The human brain often recalls past memories (seemingly) unprompted. As we go throughout our day, we have spontaneous flashes of memory from our lives. While this spontaneous conjuration of memories has long been of interest to neuroscientists, AI research company DeepMind recently published a paper detailing how an AI of theirs replicated this strange pattern of recall.
The conjuration of memories in the brain, neural replay, is tightly linked with the hippocampus. The hippocampus is a seahorse-shaped formation in the brain that belongs to the limbic system, and it is associated with the formation of new memories, as well as the emotions that memories spark. Current theories on the role of the hippocampi (there is one in each hemisphere of the brain), state that different regions of the hippocampus are responsible for the handling of different types of memories. For instance, spatial memory is believed to be handled in the rear region of the hippocampus.
As reported by Jesus Rodriguez on Medium, Dr. John O’Keefe is responsible for many contributions to our understanding of the hippocampus, including the hippocampal “place” cells. The place cells in the hippocampus are triggered by stimuli in a specific environment. As an example, experiments on rats showed that specific neurons would fire when the rats ran through certain portions of a track. Researchers continued to monitor the rats even when they were resting, and they found that the same patterns of neurons denoting a portion of the maze would fire, although they fired at an accelerated speed. The rats seemed to be replaying the memories of the maze in their minds.
In humans, recalling memories is an important part of the learning process, but when trying to enable AI to learn, it is difficult to recreate the phenomenon.
The DeepMind team set about trying to recreate the phenomenon of recall using reinforcement learning. Reinforcement learning algorithms work by getting feedback from their interactions with the environment around them, getting rewarded whenever they take actions that bring them closer to the desired goal. In this context, the reinforcement learning agent records events and then plays them back at later times, with the system being reinforced to improve how efficiently it ends up recalling past experiences.
DeepMind added the replaying of experiences to a reinforcement learning algorithm using a replay buffer that would playback memories/recorded experiences to the system at specific times. Some versions of the system had the experiences played back in random orders while other models had pre-selected playback orders. While the researchers experimented with the order of playback for the reinforcement agents, they also experimented with different methods of replaying the experiences themselves.
There are two primary methods that are used to provide reinforcement algorithms with recalled experiences. These methods are the imagination replay method and the movie replay method. The DeepMind paper uses an analogy to describe both of the strategies:
“Suppose you come home and, to your surprise and dismay, discover water pooling on your beautiful wooden floors. Stepping into the dining room, you find a broken vase. Then you hear a whimper, and you glance out the patio door to see your dog looking very guilty.”
As reported by Rodriguez, the imagination replay method doesn’t record the events in the order that they were experienced. Rather, a probable cause between the events is inferred. The events are inferred based on the agent’s understanding of the world. Meanwhile, the movie replay method stores memories in the order in which the events occurred, and replays the sequence of stimuli – “spilled water, broken vase, dog”. The chronological ordering of events is preserved.
Research from the field of neuroscience implies that the movie replay method is integral to the creation of associations between concepts and the connection of neurons between events. Yet the imagination replay method could help the agent create new sequences when it reasons by analogy. For instance, the agent could reason that if a barrel is to oil as a vase is to water, a barrel could be spilled by a factory robot instead of a dog. Indeed, when DeepMind probed further into the possibilities of the imagination replay method, they found that their learning agent was able to create impressive, innovative sequences by taking previous experiences into account.
Most of the current progress being made in the area of reinforcement learning memory is being made with the movie strategy, although researchers have recently begun to make progress with the imagination strategy. Research into both methods of AI memory can not only enable better performance from reinforcement learning agents, but they can also help us gain new insight into how the human mind might function.
DeepMind Discovers AI Training Technique That May Also Work In Our Brains
DeepMind just recently published a paper detailing how a newly developed type of reinforcement learning could potentially explain how reward pathways within the human brain operate. As reported by NewScientist, the machine learning training method is called distributional reinforcement learning and the mechanisms behind it seem to plausibly explain how dopamine is released by neurons within the brain.
Neuroscience and computer science have a long history together. As far back as 1951, Marvin Minksy used a system of rewards and punishments to create a computer program capable of solving a maze. Minksy was inspired by the work of Ivan Pavlov, a physiologist who demonstrated that dogs could learn through a series of rewards and punishments. Deepmind’s new paper adds to the intertwining history of neuroscience and computer science by applying a type of reinforcement learning to gain insight into how dopamine neurons might function.
Whenever a person, or animal, is about to carry out an action, the collections of neurons in their brain responsible for the release of dopamine make a prediction about how rewarding the action will be. Once the action has been carried out and the consequences (rewards) of that action made apparent, the brain releases dopamine. However, this dopamine release is scaled in accordance with the size of the error in prediction. If the reward is larger/better than expected, a stronger surge of dopamine is triggered. In contrast, a worse reward leads to less dopamine being released. The dopamine serves as a corrective function that makes the neurons tune their predictions until they converge on the actual rewards being earned. This is very similar to how reinforcement learning algorithms operate.
The year 2017 saw DeepMind researchers release an enhanced version of a commonly used reinforcement learning algorithm, and this superior learning method was able to boost performance on many reinforcement learning tasks. The DeepMind team thought that the mechanisms behind the new algorithm could be used to better explain how dopamine neurons operate within the human brain.
In contrast to older reinforcement learning algorithms, DeepMind’s newer algorithm represents rewards as a distribution. Older reinforcement learning approaches represented estimated rewards as just a single number that stood for the average expected result. This change allowed the model to more accurately represent possible rewards and perform better as a result. The superior performance of the new training method prompted the DeepMind researchers to investigate if dopamine neurons in the human brain operate in a similar fashion.
In order to investigate the workings of dopamine neurons, DeepMind worked alongside Harvard to research the activity of dopamine neurons in mice. The researchers had the mice perform various tasks and gave them rewards based on the roll of dice, recording how their dopamine neurons fired. Different neurons seemed to predict different potential results, releasing different amounts of dopamine. Some neurons predicted lower than the actual reward while some predicted rewards higher than the actual reward. After graphing out the distribution of the reward predictions, the researchers found that the distribution of predictions was fairly close to the genuine reward distribution. This suggests that the brain does make use of a distributional system when making predictions and adjusting predictions to better match reality.
The study could inform both neuroscience nad computer science. The study supports the use of distributional reinforcement learning as a method of creating more advanced AI models. Beyond that, it could have implications for our theories of how the brain operates regarding reward systems. If dopamine neurons are distributed and some are more pessimistic or optimistic than others, understanding these distributions could alter how we approach aspects of psychology like mental health and motivation.
As MIT Technology View reported, Matt Botvinik, the director of neuroscience research at DeepMind, explained the importance of the findings at a press briefing. Botvinik said:
“If the brain is using it, it’s probably a good idea. It tells us that this is a computational technique that can scale in real-world situations. It’s going to fit well with other computational processes. It gives us a new perspective on what’s going on in our brains during everyday life”
Ubisoft Trains AI Agent To Drive A Car In A Racing Game
The term “AI” is used a lot in discussions of video games, but it is typically used to refer to the logic that controls non-player characters in video games, rather than referring to any system driven by what computer scientists would recognize as AI. Actual applications of AI utilizing artificial neural networks are fairly rare within the video game industry, but as VentureBeat reports gaming company Ubisoft has recently published a paper investigating possible uses for an AI agent trained with reinforcement learning.
While entities like DeepMind and OpenAI have investigated how AIs perform in a variety of video games, like StarCraft 2, Dota 2, and Minecraft, very little research has been done on the use of AI under the specific constraints often faced by game developers. Ubisoft La Forge, the prototyping arm of Ubisoft, just recently published a paper detailing an algorithm capable of carrying out predictable actions within a commercial video game. According to the report, the AI algorithms were capable of hitting current benchmarks and performing complex tasks reliably.
The authors of the paper note that while reinforcement learning has been used to great effect in the context of certain video games, often achieving parity with the best human players of said games, the systems created by OpenAI and DeepMind are rarely useful for game developers. The authors note that lack of accessibility is a large issue and that the most impressive results are obtained by research groups with access to large scale computational resources, resources that typically go well beyond what the average game developer has access to. Wrote the researchers:
“These systems have comparatively seen little use within the video game industry, and we believe lack of accessibility to be a major reason behind this. Indeed, really impressive results … are produced by large research groups with computational resources well beyond what is typically available within video game studios.”
The research team from Ubisoft aimed to remedy some of these problems by creating a reinforcement learning approach that optimized for issues like data sample collection and runtime budget constraints. Ubisoft’s solution was adapted from research done at the University of California, Berkeley. The Soft Actor-Critic model developed by UC Berkely researches is able to create a model that can effectively generalize to new conditions and is much more sample-efficient than most models. The Ubisoft team took this approach and adapted it for both discrete and continuous actions.
The Ubisoft research team evaluated the performance of their algorithm on three different games. There were two soccer games used to test the algorithm, as well as a simple platformer-style game. While the results for these games was slightly worse than the state-of-the-art industry results, another test was conducted in which the algorithms performed much better. The researchers used a driving video game as their test case, having the AI agent follow a given path and negotiate obstacles in an environment the agent hadn’t witnessed during training. There were two continuous actions, steering and acceleration, as well as one binary action (breaking).
The researchers summarized their results in the paper, declaring that the hybrid Soft Actor-Critic approach was successful when training an AI agent to drive at high speeds in a commercially available video game. According to the researchers, their training approach can potentially work for a wide variety of possible interaction approaches. These include instances where the AI agent has the exact same input options that the player has, demonstrating the “practical usefulness of such an algorithm for the video game industry.”
DeepMind Reports New Method Of Training Reinforcement Learning AI Safely
Reinforcement learning is a promising avenue of AI development, producing AI that can handle extremely complex tasks. Reinforcement AI algorithms are used in the creation of mobile robotics systems and self-driving cars among other applications. However, due to the way that reinforcement AI is trained, they can occasionally manifest bizarre and unexpected behaviors. These behaviors can be dangerous, and AI researchers refer to this problem as the “safe exploration” problem, which is where the AI becomes stuck in the exploration of unsafe states.
Recently, Google’s AI research lab DeepMind released a paper that proposed new methods for dealing with the safe exploration problem and training reinforcement learning AI in a safer fashion. The method suggested by DeepMind also corrects for reward hacking or loopholes in the reward criteria.
DeepMind’s new method has two different systems intended to guide the behavior of the AI in situations where unsafe behavior could arise. The two systems used by DeepMind’s training technique are a generative model and a forward dynamics model. Both of these models are trained on a variety of data, such as demonstrations by safety experts and completely random vehicle trajectories. The data is labeled by a supervisor with specific reward values, and the AI agent will pick up on patterns of behavior that will enable it to collect the greatest reward. The unsafe states have also been labeled, and once the model has managed to successfully predict rewards and unsafe states it is deployed to carry out the targeted actions.
The research team explains in the paper that the idea is to create possible behaviors from scratch, to suggest the desired behaviors, and to have these hypothetical scenarios be as informative as possible while simultaneously avoiding direct interference with the learning environment. The DeepMind team refers to this approach as ReQueST, or reward query synthesis via trajectory optimization.
ReQueST is capable of leading to four different types of behavior. The first type of behavior tries to maximize uncertainty regarding ensemble reward models. Meanwhile, behavior two and three attempts to both minimize and maximize predicted rewards. Predicted rewards are minimized in order to lead to the discovery of behaviors that the model may be incorrectly predicting. On the other hand, predicted reward is maximized in order to lead to behavior labels possessing the highest information value. Finally, the fourth type of behavior tries to maximize the novelty of trajectories, in order that the model continue to explore regardless of the rewards projected.
Once the model has reached the desired level of reward collection, a planning agent is used to make decisions based on the learned rewards. This model-predictive control scheme lets agents learn to avoid unsafe states by using the dynamic model and predicting possible consequences, in contrast to the behaviors of algorithms that learn through pure trial and error.
As reported by VentureBeat, the DeepMind researchers believe that their project is the first reinforcement learning system that is capable of learning in a controlled, safe fashion:
“To our knowledge, ReQueST is the first reward modeling algorithm that safely learns about unsafe states and scales to training neural network reward models in environments with high-dimensional, continuous states. So far, we have only demonstrated the effectiveness of ReQueST in simulated domains with relatively simple dynamics. One direction for future work is to test ReQueST in 3D domains with more realistic physics and other agents acting in the environment.”