The term “AI” is used a lot in discussions of video games, but it is typically used to refer to the logic that controls non-player characters in video games, rather than referring to any system driven by what computer scientists would recognize as AI. Actual applications of AI utilizing artificial neural networks are fairly rare within the video game industry, but as VentureBeat reports gaming company Ubisoft has recently published a paper investigating possible uses for an AI agent trained with reinforcement learning.
While entities like DeepMind and OpenAI have investigated how AIs perform in a variety of video games, like StarCraft 2, Dota 2, and Minecraft, very little research has been done on the use of AI under the specific constraints often faced by game developers. Ubisoft La Forge, the prototyping arm of Ubisoft, just recently published a paper detailing an algorithm capable of carrying out predictable actions within a commercial video game. According to the report, the AI algorithms were capable of hitting current benchmarks and performing complex tasks reliably.
The authors of the paper note that while reinforcement learning has been used to great effect in the context of certain video games, often achieving parity with the best human players of said games, the systems created by OpenAI and DeepMind are rarely useful for game developers. The authors note that lack of accessibility is a large issue and that the most impressive results are obtained by research groups with access to large scale computational resources, resources that typically go well beyond what the average game developer has access to. Wrote the researchers:
“These systems have comparatively seen little use within the video game industry, and we believe lack of accessibility to be a major reason behind this. Indeed, really impressive results … are produced by large research groups with computational resources well beyond what is typically available within video game studios.”
The research team from Ubisoft aimed to remedy some of these problems by creating a reinforcement learning approach that optimized for issues like data sample collection and runtime budget constraints. Ubisoft’s solution was adapted from research done at the University of California, Berkeley. The Soft Actor-Critic model developed by UC Berkely researches is able to create a model that can effectively generalize to new conditions and is much more sample-efficient than most models. The Ubisoft team took this approach and adapted it for both discrete and continuous actions.
The Ubisoft research team evaluated the performance of their algorithm on three different games. There were two soccer games used to test the algorithm, as well as a simple platformer-style game. While the results for these games was slightly worse than the state-of-the-art industry results, another test was conducted in which the algorithms performed much better. The researchers used a driving video game as their test case, having the AI agent follow a given path and negotiate obstacles in an environment the agent hadn’t witnessed during training. There were two continuous actions, steering and acceleration, as well as one binary action (breaking).
The researchers summarized their results in the paper, declaring that the hybrid Soft Actor-Critic approach was successful when training an AI agent to drive at high speeds in a commercially available video game. According to the researchers, their training approach can potentially work for a wide variety of possible interaction approaches. These include instances where the AI agent has the exact same input options that the player has, demonstrating the “practical usefulness of such an algorithm for the video game industry.”
DeepMind and Google Brain Aim Create Methods to Improve Efficiency of Reinforcement Learning
Reinforcement learning systems can be powerful and robust, able to carry out extremely complex tasks through thousands of iterations of training. While reinforcement learning algorithms are capable of enabling sophisticated and occasionally surprising behavior, they take a long time to train and require vast amounts of data. These factors make reinforcement learning techniques rather inefficient, and recently research teams from Alphabet DeepMind and Google Brain have endeavored to find more efficient methods of creating reinforcement learning systems.
As reported by VentureBeat, the combined research group recently proposed methods of making reinforcement learning training more efficient. One of the proposed improvements was an algorithm dubbed Adaptive Behavior Policy Sharing (ABPS), while the other was a framework called Universal Value Function Approximators (UVFA). ABPS lets pools of AI agents share their adaptively selected experiences, while UVFA lets those AI simultaneously investigate directed exploration policies.
ABPS is intended to expedite the customization of hyperparameters when training a model. ABPS makes finding the optimal hyperparameters quicker by allowing several different agents with different hyperparameters to share their behavior policy experiences. To be more precise, ABPS lets reinforcement learning agents select actions from those actions that a policy has deemed okay and afterward it’s granted a reward and observation based on the following state.
AI reinforcement agents are trained with various combinations of possible hyperparameters, like decay rate and learning rate. When training a model, the goal is that the model converges on the combination of hyperparameters that gives it the best performance, and in this case those that also improve data efficiency. The efficiency is increased by training many agents at one time and choosing the behavior of only one agent to be deployed during the next time step. The policy that the target agent has is used to sample actions. The transitions are then logged within a shared space, and this space is constantly evaluated so that policy selection doesn’t have to occur as often. At the end of the training, an ensemble of agents is chosen and the top performing agents are selected to undergo final deployment.
In terms of UVFA, it attempts to deal with one of the common problems of reinforcement learning, that weakly reinforced agents often don’t learn tasks. UVFA attempts to solve the issue by having the agent learn a separate set of exploitation and exploration policies at the same time. Separating the tasks creates a framework that allows the exploratory policies to keep exploring the environment while the exploitation policies continue to try and maximize the reward for the current task. The exploratory policies of UVFA serve as a baseline architecture that will continue to improve even if there are no natural rewards being found. In such a condition, a function which corresponds to intrinsic rewards is approximated, which pushes the agents to explore all states in an environment, even if they often return to familiar states.
As VentureBeat explained, when the UVFA framework is in play, the intrinsic rewards of the system are given directly to the agent as inputs. The agent then keeps track of a representation of all inputs (such as rewards, action, and state) during a given episode. The result is that the reward is preserved over time and the agent’s policy is at least somewhat informed by it at all times.
This is accomplished with the utilization of an “episodic novelty” and a “life-long novelty” module. The function of the first module is to hold the current, episodic memory and map the current findings to the previously mentioned representation, letting the agent determine an intrinsic episodic reward for every step of training. Afterward, the state-linked with the current observation is added into memory. Meanwhile, the life-long novelty module is responsible for influencing how often the agent explores over the course of many episodes.
According to the Alphabet/Google teams, the new training techniques have already demonstrated the potential for substantial improvement while training a reinforcement learning system. UVFA was able to double the performance of some of the base agents that played various Atari games. Meanwhile, ABPS was able to increase performance on some of the same Atari games, decreasing variance amongst the top performing agents by approximately 25%. The UVFA trained algorithm was able to achieve a high score in Pitfall by itself, lacking any engineered features of human demos.
DeepMind Discovers AI Training Technique That May Also Work In Our Brains
DeepMind just recently published a paper detailing how a newly developed type of reinforcement learning could potentially explain how reward pathways within the human brain operate. As reported by NewScientist, the machine learning training method is called distributional reinforcement learning and the mechanisms behind it seem to plausibly explain how dopamine is released by neurons within the brain.
Neuroscience and computer science have a long history together. As far back as 1951, Marvin Minksy used a system of rewards and punishments to create a computer program capable of solving a maze. Minksy was inspired by the work of Ivan Pavlov, a physiologist who demonstrated that dogs could learn through a series of rewards and punishments. Deepmind’s new paper adds to the intertwining history of neuroscience and computer science by applying a type of reinforcement learning to gain insight into how dopamine neurons might function.
Whenever a person, or animal, is about to carry out an action, the collections of neurons in their brain responsible for the release of dopamine make a prediction about how rewarding the action will be. Once the action has been carried out and the consequences (rewards) of that action made apparent, the brain releases dopamine. However, this dopamine release is scaled in accordance with the size of the error in prediction. If the reward is larger/better than expected, a stronger surge of dopamine is triggered. In contrast, a worse reward leads to less dopamine being released. The dopamine serves as a corrective function that makes the neurons tune their predictions until they converge on the actual rewards being earned. This is very similar to how reinforcement learning algorithms operate.
The year 2017 saw DeepMind researchers release an enhanced version of a commonly used reinforcement learning algorithm, and this superior learning method was able to boost performance on many reinforcement learning tasks. The DeepMind team thought that the mechanisms behind the new algorithm could be used to better explain how dopamine neurons operate within the human brain.
In contrast to older reinforcement learning algorithms, DeepMind’s newer algorithm represents rewards as a distribution. Older reinforcement learning approaches represented estimated rewards as just a single number that stood for the average expected result. This change allowed the model to more accurately represent possible rewards and perform better as a result. The superior performance of the new training method prompted the DeepMind researchers to investigate if dopamine neurons in the human brain operate in a similar fashion.
In order to investigate the workings of dopamine neurons, DeepMind worked alongside Harvard to research the activity of dopamine neurons in mice. The researchers had the mice perform various tasks and gave them rewards based on the roll of dice, recording how their dopamine neurons fired. Different neurons seemed to predict different potential results, releasing different amounts of dopamine. Some neurons predicted lower than the actual reward while some predicted rewards higher than the actual reward. After graphing out the distribution of the reward predictions, the researchers found that the distribution of predictions was fairly close to the genuine reward distribution. This suggests that the brain does make use of a distributional system when making predictions and adjusting predictions to better match reality.
The study could inform both neuroscience nad computer science. The study supports the use of distributional reinforcement learning as a method of creating more advanced AI models. Beyond that, it could have implications for our theories of how the brain operates regarding reward systems. If dopamine neurons are distributed and some are more pessimistic or optimistic than others, understanding these distributions could alter how we approach aspects of psychology like mental health and motivation.
As MIT Technology View reported, Matt Botvinik, the director of neuroscience research at DeepMind, explained the importance of the findings at a press briefing. Botvinik said:
“If the brain is using it, it’s probably a good idea. It tells us that this is a computational technique that can scale in real-world situations. It’s going to fit well with other computational processes. It gives us a new perspective on what’s going on in our brains during everyday life”
DeepMind Reports New Method Of Training Reinforcement Learning AI Safely
Reinforcement learning is a promising avenue of AI development, producing AI that can handle extremely complex tasks. Reinforcement AI algorithms are used in the creation of mobile robotics systems and self-driving cars among other applications. However, due to the way that reinforcement AI is trained, they can occasionally manifest bizarre and unexpected behaviors. These behaviors can be dangerous, and AI researchers refer to this problem as the “safe exploration” problem, which is where the AI becomes stuck in the exploration of unsafe states.
Recently, Google’s AI research lab DeepMind released a paper that proposed new methods for dealing with the safe exploration problem and training reinforcement learning AI in a safer fashion. The method suggested by DeepMind also corrects for reward hacking or loopholes in the reward criteria.
DeepMind’s new method has two different systems intended to guide the behavior of the AI in situations where unsafe behavior could arise. The two systems used by DeepMind’s training technique are a generative model and a forward dynamics model. Both of these models are trained on a variety of data, such as demonstrations by safety experts and completely random vehicle trajectories. The data is labeled by a supervisor with specific reward values, and the AI agent will pick up on patterns of behavior that will enable it to collect the greatest reward. The unsafe states have also been labeled, and once the model has managed to successfully predict rewards and unsafe states it is deployed to carry out the targeted actions.
The research team explains in the paper that the idea is to create possible behaviors from scratch, to suggest the desired behaviors, and to have these hypothetical scenarios be as informative as possible while simultaneously avoiding direct interference with the learning environment. The DeepMind team refers to this approach as ReQueST, or reward query synthesis via trajectory optimization.
ReQueST is capable of leading to four different types of behavior. The first type of behavior tries to maximize uncertainty regarding ensemble reward models. Meanwhile, behavior two and three attempts to both minimize and maximize predicted rewards. Predicted rewards are minimized in order to lead to the discovery of behaviors that the model may be incorrectly predicting. On the other hand, predicted reward is maximized in order to lead to behavior labels possessing the highest information value. Finally, the fourth type of behavior tries to maximize the novelty of trajectories, in order that the model continue to explore regardless of the rewards projected.
Once the model has reached the desired level of reward collection, a planning agent is used to make decisions based on the learned rewards. This model-predictive control scheme lets agents learn to avoid unsafe states by using the dynamic model and predicting possible consequences, in contrast to the behaviors of algorithms that learn through pure trial and error.
As reported by VentureBeat, the DeepMind researchers believe that their project is the first reinforcement learning system that is capable of learning in a controlled, safe fashion:
“To our knowledge, ReQueST is the first reward modeling algorithm that safely learns about unsafe states and scales to training neural network reward models in environments with high-dimensional, continuous states. So far, we have only demonstrated the effectiveness of ReQueST in simulated domains with relatively simple dynamics. One direction for future work is to test ReQueST in 3D domains with more realistic physics and other agents acting in the environment.”
- Researchers Improve Robotic Arm Used in Surgery
- DeepMind and Google Brain Aim Create Methods to Improve Efficiency of Reinforcement Learning
- Deep Learning Used to Find Disease-Related Genes
- AI “Maths Robot” Helps Manage Microclimates and Increase Berry Yield Predictions
- Computer Scientists Tackle Bias in AI