One of the interesting facts about researching AI is that it can often execute actions and pursue strategies that surprise the very researchers designing them. This happened during a recent virtual game of hide and seek where multiple AI agents were pitted against one another. Researchers at OpenAI, an AI firm based out of San Francisco, were surprised to find that their AI agents started exploiting strategies in the game world that the researchers didn’t even know existed.
OpenAI has trained a group of AI agents to play a hide and seek game with each other. The AI programs are trained with reinforcement learning, a technique where the desired behavior is elicited from the AI algorithms by providing the algorithms with feedback. The AI starts out by taking random actions, and every time it takes an action that gets it closer to its goal, the agent is rewarded. The AI desires to gain the maximum amount of reward possible, so it will experiment to see which actions gain it more reward. Through trial and error the AI is capable of distinguishing strategies that will bring them to victory, those which will give them the most reward.
Reinforcement learning has already demonstrated impressive success at learning the rules of games. OpenAI recently trained a team of AI to play the MMORPG DOTA 2, and the AI defeated a world-champion team of human players last year. A similar thing happened with the game StarCraft when an AI was trained on the game by DeepMind. Reinforcement learning has also been used to teach AI programs to play Pictionary with humans, learning to interpret pictures and use basic common sense reasoning.
In the hide and seek video game created by the researchers, multiple AI agents were pitted against one another. The result was an arms race of sorts, where each agent wants to outperform the other and obtain the most reward points. A new strategy adopted by one agent will cause its opponent to seek a new strategy to counter it, and vice-versa. Igor Mordatch, a researcher at OpenAI, explained to IEEE Spectrum that the experiment demonstrates that this process of trial and error playing between agents “is enough for the agents to learn surprising behaviors on their own—it’s like children playing with each other.”
What were the surprising behaviors exactly? The researchers had four basic strategies that they expected the AI agents to learn, and they learned these fairly quickly, becoming competent in them after just 25 million simulated games. The game took place in a 3d environment full of ramps, blocks, and walls. The AI agents learned to chase each other around, move blocks to build forts they could hide in, and move ramps around. The AI seekers learned to drag ramps around to get inside the hiders’ forts, while the hiders learned to try and take the ramps into their forts so the seekers couldn’t use them.
However, around the benchmark of 380 million games, something unexpected happened. The AI agents learned to use two strategies the researchers didn’t expect. The seeker agents learned that by jumping onto a box and tilting/riding the box towards a nearby fort, they could jump into the fort and find the hider. The researchers hadn’t even realized that this was possible within the physics of the game environment. The hiders learned to deal with this issue by dragging the boxes into place within their fort.
While the unexpected behavior of agents trained on reinforcement learning algorithms is harmless in this instance, it does raise some potential concerns about how reinforcement learning is applied to other situations. A member of the OpenAI research team, Bowen Baker, explained to IEEE Spectrum that these unexpected behaviors could be potentially dangerous. After all, what if robots started behaving in unexpected ways?
“Building these environments is hard,” Baker explained. “The agents will come up with these unexpected behaviors, which will be a safety problem down the road when you put them in more complex environments.”
However, Baker also explained that reinforcement strategies could lead to innovative solutions to current problems. Systems trained with reinforcement learning could solve a wide array of problems with solutions we may not able to even imagine.
DeepMind Discovers AI Training Technique That May Also Work In Our Brains
DeepMind just recently published a paper detailing how a newly developed type of reinforcement learning could potentially explain how reward pathways within the human brain operate. As reported by NewScientist, the machine learning training method is called distributional reinforcement learning and the mechanisms behind it seem to plausibly explain how dopamine is released by neurons within the brain.
Neuroscience and computer science have a long history together. As far back as 1951, Marvin Minksy used a system of rewards and punishments to create a computer program capable of solving a maze. Minksy was inspired by the work of Ivan Pavlov, a physiologist who demonstrated that dogs could learn through a series of rewards and punishments. Deepmind’s new paper adds to the intertwining history of neuroscience and computer science by applying a type of reinforcement learning to gain insight into how dopamine neurons might function.
Whenever a person, or animal, is about to carry out an action, the collections of neurons in their brain responsible for the release of dopamine make a prediction about how rewarding the action will be. Once the action has been carried out and the consequences (rewards) of that action made apparent, the brain releases dopamine. However, this dopamine release is scaled in accordance with the size of the error in prediction. If the reward is larger/better than expected, a stronger surge of dopamine is triggered. In contrast, a worse reward leads to less dopamine being released. The dopamine serves as a corrective function that makes the neurons tune their predictions until they converge on the actual rewards being earned. This is very similar to how reinforcement learning algorithms operate.
The year 2017 saw DeepMind researchers release an enhanced version of a commonly used reinforcement learning algorithm, and this superior learning method was able to boost performance on many reinforcement learning tasks. The DeepMind team thought that the mechanisms behind the new algorithm could be used to better explain how dopamine neurons operate within the human brain.
In contrast to older reinforcement learning algorithms, DeepMind’s newer algorithm represents rewards as a distribution. Older reinforcement learning approaches represented estimated rewards as just a single number that stood for the average expected result. This change allowed the model to more accurately represent possible rewards and perform better as a result. The superior performance of the new training method prompted the DeepMind researchers to investigate if dopamine neurons in the human brain operate in a similar fashion.
In order to investigate the workings of dopamine neurons, DeepMind worked alongside Harvard to research the activity of dopamine neurons in mice. The researchers had the mice perform various tasks and gave them rewards based on the roll of dice, recording how their dopamine neurons fired. Different neurons seemed to predict different potential results, releasing different amounts of dopamine. Some neurons predicted lower than the actual reward while some predicted rewards higher than the actual reward. After graphing out the distribution of the reward predictions, the researchers found that the distribution of predictions was fairly close to the genuine reward distribution. This suggests that the brain does make use of a distributional system when making predictions and adjusting predictions to better match reality.
The study could inform both neuroscience nad computer science. The study supports the use of distributional reinforcement learning as a method of creating more advanced AI models. Beyond that, it could have implications for our theories of how the brain operates regarding reward systems. If dopamine neurons are distributed and some are more pessimistic or optimistic than others, understanding these distributions could alter how we approach aspects of psychology like mental health and motivation.
As MIT Technology View reported, Matt Botvinik, the director of neuroscience research at DeepMind, explained the importance of the findings at a press briefing. Botvinik said:
“If the brain is using it, it’s probably a good idea. It tells us that this is a computational technique that can scale in real-world situations. It’s going to fit well with other computational processes. It gives us a new perspective on what’s going on in our brains during everyday life”
Ubisoft Trains AI Agent To Drive A Car In A Racing Game
The term “AI” is used a lot in discussions of video games, but it is typically used to refer to the logic that controls non-player characters in video games, rather than referring to any system driven by what computer scientists would recognize as AI. Actual applications of AI utilizing artificial neural networks are fairly rare within the video game industry, but as VentureBeat reports gaming company Ubisoft has recently published a paper investigating possible uses for an AI agent trained with reinforcement learning.
While entities like DeepMind and OpenAI have investigated how AIs perform in a variety of video games, like StarCraft 2, Dota 2, and Minecraft, very little research has been done on the use of AI under the specific constraints often faced by game developers. Ubisoft La Forge, the prototyping arm of Ubisoft, just recently published a paper detailing an algorithm capable of carrying out predictable actions within a commercial video game. According to the report, the AI algorithms were capable of hitting current benchmarks and performing complex tasks reliably.
The authors of the paper note that while reinforcement learning has been used to great effect in the context of certain video games, often achieving parity with the best human players of said games, the systems created by OpenAI and DeepMind are rarely useful for game developers. The authors note that lack of accessibility is a large issue and that the most impressive results are obtained by research groups with access to large scale computational resources, resources that typically go well beyond what the average game developer has access to. Wrote the researchers:
“These systems have comparatively seen little use within the video game industry, and we believe lack of accessibility to be a major reason behind this. Indeed, really impressive results … are produced by large research groups with computational resources well beyond what is typically available within video game studios.”
The research team from Ubisoft aimed to remedy some of these problems by creating a reinforcement learning approach that optimized for issues like data sample collection and runtime budget constraints. Ubisoft’s solution was adapted from research done at the University of California, Berkeley. The Soft Actor-Critic model developed by UC Berkely researches is able to create a model that can effectively generalize to new conditions and is much more sample-efficient than most models. The Ubisoft team took this approach and adapted it for both discrete and continuous actions.
The Ubisoft research team evaluated the performance of their algorithm on three different games. There were two soccer games used to test the algorithm, as well as a simple platformer-style game. While the results for these games was slightly worse than the state-of-the-art industry results, another test was conducted in which the algorithms performed much better. The researchers used a driving video game as their test case, having the AI agent follow a given path and negotiate obstacles in an environment the agent hadn’t witnessed during training. There were two continuous actions, steering and acceleration, as well as one binary action (breaking).
The researchers summarized their results in the paper, declaring that the hybrid Soft Actor-Critic approach was successful when training an AI agent to drive at high speeds in a commercially available video game. According to the researchers, their training approach can potentially work for a wide variety of possible interaction approaches. These include instances where the AI agent has the exact same input options that the player has, demonstrating the “practical usefulness of such an algorithm for the video game industry.”
DeepMind Reports New Method Of Training Reinforcement Learning AI Safely
Reinforcement learning is a promising avenue of AI development, producing AI that can handle extremely complex tasks. Reinforcement AI algorithms are used in the creation of mobile robotics systems and self-driving cars among other applications. However, due to the way that reinforcement AI is trained, they can occasionally manifest bizarre and unexpected behaviors. These behaviors can be dangerous, and AI researchers refer to this problem as the “safe exploration” problem, which is where the AI becomes stuck in the exploration of unsafe states.
Recently, Google’s AI research lab DeepMind released a paper that proposed new methods for dealing with the safe exploration problem and training reinforcement learning AI in a safer fashion. The method suggested by DeepMind also corrects for reward hacking or loopholes in the reward criteria.
DeepMind’s new method has two different systems intended to guide the behavior of the AI in situations where unsafe behavior could arise. The two systems used by DeepMind’s training technique are a generative model and a forward dynamics model. Both of these models are trained on a variety of data, such as demonstrations by safety experts and completely random vehicle trajectories. The data is labeled by a supervisor with specific reward values, and the AI agent will pick up on patterns of behavior that will enable it to collect the greatest reward. The unsafe states have also been labeled, and once the model has managed to successfully predict rewards and unsafe states it is deployed to carry out the targeted actions.
The research team explains in the paper that the idea is to create possible behaviors from scratch, to suggest the desired behaviors, and to have these hypothetical scenarios be as informative as possible while simultaneously avoiding direct interference with the learning environment. The DeepMind team refers to this approach as ReQueST, or reward query synthesis via trajectory optimization.
ReQueST is capable of leading to four different types of behavior. The first type of behavior tries to maximize uncertainty regarding ensemble reward models. Meanwhile, behavior two and three attempts to both minimize and maximize predicted rewards. Predicted rewards are minimized in order to lead to the discovery of behaviors that the model may be incorrectly predicting. On the other hand, predicted reward is maximized in order to lead to behavior labels possessing the highest information value. Finally, the fourth type of behavior tries to maximize the novelty of trajectories, in order that the model continue to explore regardless of the rewards projected.
Once the model has reached the desired level of reward collection, a planning agent is used to make decisions based on the learned rewards. This model-predictive control scheme lets agents learn to avoid unsafe states by using the dynamic model and predicting possible consequences, in contrast to the behaviors of algorithms that learn through pure trial and error.
As reported by VentureBeat, the DeepMind researchers believe that their project is the first reinforcement learning system that is capable of learning in a controlled, safe fashion:
“To our knowledge, ReQueST is the first reward modeling algorithm that safely learns about unsafe states and scales to training neural network reward models in environments with high-dimensional, continuous states. So far, we have only demonstrated the effectiveness of ReQueST in simulated domains with relatively simple dynamics. One direction for future work is to test ReQueST in 3D domains with more realistic physics and other agents acting in the environment.”