If you’ve read about how neural networks are trained, you’ve almost certainly come across the term “gradient descent” before. Gradient descent is the primary method of optimizing a neural network’s performance, reducing the network’s loss/error rate. However, gradient descent can be a little hard to understand for those new to machine learning, and this article will endeavor to give you a decent intuition for how gradient descent operates.
Gradient descent is an optimization algorithm. It’s used to improve the performance of a neural network by making tweaks to the parameters of the network such that the difference between the network’s predictions and the actual/expected values of the network (referred to as the loss) is a small as possible. Gradient descent takes the initial values of the parameters and uses operations based in calculus to adjust their values towards the values that will make the network as accurate as it can be. You don’t need to know a lot of calculus to understand how gradient descent works, but you do need to have an understanding of gradients.
What Are Gradients?
Assume that there is a graph that represents the amount of error a neural network makes. The bottom of the graph represents the points of lowest error while the top of the graph is where the error is the highest. We want to move from the top of the graph down to the bottom. A gradient is just a way of quantifying the relationship between error and the weights of the neural network. The relationship between these two things can be graphed as a slope, with incorrect weights producing more error. The steepness of the slope/gradient represents how fast the model is learning.
A steeper slope means large reductions in error are being made and the model is learning fast, whereas if the slope is zero the model is on a plateau and isn’t learning. We can move down the slope towards less error by calculating a gradient, a direction of movement (change in the parameters of the network) for our model.
Let’s shift the metaphor just slightly and imagine a series of hills and valleys. We want to get to the bottom of the hill and find the part of the valley that represents the lowest loss. When we start at the top of the hill we can take large steps down the hill and be confident that we are heading towards the lowest point in the valley.
However, as we get closer to the lowest point in the valley, our steps will need to become smaller, or else we could overshoot the true lowest point. Similarly, it’s possible that when adjusting the weights of the network, the adjustments can actually take it further away from the point of lowest loss, and therefore the adjustments must get smaller over time. In the context of descending a hill towards a point of lowest loss, the gradient is a vector/instructions detailing the path we should take and how large our steps should be.
Now we know that gradients are instructions that tell us which direction to move in (which coefficients should be updated) and how large the steps we should take are (how much the coefficients should be updated), we can explore how the gradient is calculated.
Calculating Gradients and Gradient Descent Procedure
In order to carry out gradient descent, the gradients must first be calculated. In order to calculate the gradient, we need to know the loss/cost function. We’ll use the cost function to determine the derivative. In calculus, the derivative just refers to the slope of a function at a given point, so we’re basically just calculating the slope of the hill based on the loss function. We determine the loss by running the coefficients through the loss function. If we represent the loss function as “f”, then we can state that the equation for calculating the loss is as follows (we’re just running the coefficients through our chosen cost function):
Loss = f(coefficient)
We then calculate the derivative, or determine the slope. Getting the derivative of the loss will tell us which direction is up or down the slope, by giving us the appropriate sign to adjust our coefficients by. We’ll represent the appropriate direction as “delta”.
delta = derivative_function(loss)
We’ve now determined which direction is downhill towards the point of lowest loss. This means we can update the coefficients in the neural network parameters and hopefully reduce the loss. We’ll update the coefficients based on the previous coefficients minus the appropriate change in value as determined by the direction (delta) and an argument that controls the magnitude of change (the size of our step). The argument that controls the size of the update is called the “learning rate” and we’ll represent it as “alpha”.
coefficient = coefficient – (alpha * delta)
We then just repeat this process until the network has converged around the point of lowest loss, which should be near zero.
It’s very important to choose the right value for the learning rate (alpha). The chosen learning rate must be neither too small or too large. Remember that as we approach the point of lowest loss our steps must become smaller or else we will overshoot the true point of lowest loss and end up on the other side. The point of smallest loss is small and if our rate of change is too large the error can end up increasing again. If the step sizes are too large the network’s performance will continue to bounce around the point of lowest loss, overshooting it on one side and then the other. If this happens the network will never converge on the true optimal weight configuration.
In contrast, if the learning rate is too small the network can potentially take an extraordinarily long time to converge on the optimal weights.
Types Of Gradient Descent
Now that we understand how gradient descent works in general, let’s take a look at some of the different types of gradient descent.
Batch Gradient Descent: This form of gradient descent runs through all the training samples before updating the coefficients. This type of gradient descent is likely to be the most computationally efficient form of gradient descent, as the weights are only updated once the entire batch has been processed, meaning there are fewer updates total. However, if the dataset contains a large number of training examples, then batch gradient descent can make training take a long time.
Stochastic Gradient Descent: In Stochastic Gradient Descent only a single training example is processed for every iteration of gradient descent and parameter updating. This occurs for every training example. Because only one training example is processed before the parameters are updated, it tends to converge faster than Batch Gradient Descent, as updates are made sooner. However, because the process must be carried out on every item in the training set, it can take quite a long time to complete if the dataset is large, and so use of one of the other gradient descent types if preferred.
Mini-Batch Gradient Descent: Mini-Batch Gradient Descent operates by splitting the entire training dataset up into subsections. It creates smaller mini-batches that are run through the network, and when the mini-batch has been used to calculate the error the coefficients are updated. Mini-batch Gradient Descent strikes a middle ground between Stochastic Gradient Descent and Batch Gradient Descent. The model is updated more frequently than in the case of Batch Gradient Descent, which means a slightly faster and more robust convergence on the model’s optimal parameters. It’s also more computationally efficient than Stochastic Gradient Descent
What is Federated Learning?
The traditional method of training AI models involves setting up servers where models are trained on data, often through the use of a cloud-based computing platform. However, over the past few years an alternative form of model creation has arisen, called federated learning. Federated learning brings machine learning models to the data source, rather than bringing the data to the model. Federated learning links together multiple computational devices into a decentralized system that allows the individual devices that collect data to assist in training the model.
In a federated learning system, the various devices that are part of the learning network each have a copy of the model on the device. The different devices/clients train their own copy of the model using the client’s local data, and then the parameters/weights from the individual models are sent to a master device, or server, that aggregates the parameters and updates the global model. This training process can then be repeated until a desired level of accuracy is attained. In short, the idea behind federated learning is that none of the training data is ever transmitted between devices or between parties, only the updates related to the model are.
Federated learning can be broken down into three different steps or phases. Federated learning typically starts with a generic model that acts as a baseline and is trained on a central server. In the first step, this generic model is sent out to the application’s clients. These local copies are then trained on data generated by the client systems, learning and improving their performance.
In the second step, the clients all send their learned model parameters to the central server. This happens periodically, on a set schedule.
In the third step, the server aggregates the learned parameters when it receives them. After the parameters are aggregated, the central model is updated and shared once more with the clients. The entire process then repeats.
The benefit of having a copy of the model on the various devices is that network latencies are reduced or eliminated. The costs associated with sharing data with the server is eliminated as well. Other benefits of federate learning methods include the fact that federated learning models are privacy preserved, and model responses are personalized for the user of the device.
Examples of federated learning models include recommendation engines, fraud detection models, and medical models. Media recommendation engines, of the type used by Netflix or Amazon, could be trained on data gathered from thousands of users. The client devices would train their own separate models and the central model would learn to make better predictions, even though the individual data points would be unique to the different users. Similarly, fraud detection models used by banks can be trained on patterns of activity from many different devices, and a handful of different banks could collaborate to train a common model. In terms of a medical federated learning model, multiple hospitals could team up to train a common model that could recognize potential tumors through medical scans.
Types of Federated Learning
Federated learning schemas typically fall into one of two different classes: multi-party systems and single-party systems. Single-party federated learning systems are called “single-party” because only a single entity is responsible for overseeing the capture and flow of data across all of the client devices in the learning network. The models that exist on the client devices are trained on data with the same structure, though the data points are typically unique to the various users and devices.
In contrast to single-party systems, multi-party systems are managed by two or more entities. These entities cooperate to train a shared model by utilizing the various devices and datasets they have access to. The parameters and data structures are typically similar across the devices belonging to the multiple entities, but they don’t have to be exactly the same. Instead, pre-processing is done to standardize the inputs of the model. A neutral entity might be employed to aggregate the weights established by the devices unique to the different entities.
Common Technologies and Frameworks for Federated Learning
Popular frameworks used for federated learning include Tensorflow Federated, Federated AI Technology Enabler (FATE), and PySyft. PySyft is an open-source federated learning library based on the deep learning library PyTorch. PySyft is intended to ensure private, secure deep learning across servers and agents using encrypted computation. Meanwhile, Tensorflow Federated is another open-source framework built on Google’s Tensorflow platform. In addition to enabling users to create their own algorithms, Tensorflow Federated allows users to simulate a number of included federated learning algorithms on their own models and data. Finally, FATE is also open-source framework designed by Webank AI, and it’s intended to provide the Federated AI ecosystem with a secure computing framework.
Federated Learning Challenges
As federated learning is still fairly nascent, a number of challenges still have to be negotiated in order for it to achieve its full potential. The training capabilities of edge devices, data labeling and standardization, and model convergence are potential roadblocks for federated learning approaches.
The computational abilities of the edge devices, when it comes to local training, need to be considered when designing federated learning approaches. While most smartphones, tablets, and other IoT compatible devices are capable of training machine learning models, this typically hampers the performance of the device. Compromises will have to be made between model accuracy and device performance.
Labeling and standardizing data is another challenge that federated learning systems must overcome. Supervised learning models require training data that is clearly and consistently labeled, which can be difficult to do across the many client devices that are part of the system. For this reason, it’s important to develop model data pipelines that automatically apply labels in a standardized way based on events and user actions.
Model convergence time is another challenge for federated learning, as federated learning models typically take longer to converge than locally trained models. The number of devices involved in the training adds an element of unpredictability to the model training, as connection issues, irregular updates, and even different application use times can contribute to increased convergence time and decreased reliability. For this reason, federated learning solutions are typically most useful when they provide meaningful advantages over centrally training a model, such as instances where datasets are extremely large and distributed.
What is Deep Reinforcement Learning?
Along with unsupervised machine learning and supervised learning, another common form of AI creation is reinforcement learning. Beyond regular reinforcement learning, deep reinforcement learning can lead to astonishingly impressive results, thanks to the fact that it combines the best aspects of both deep learning and reinforcement learning. Let’s take a look at precisely how deep reinforcement learning operates. Note that this article won’t delve too deeply into the formulas used in deep reinforcement learning, rather it aims to give the reader a high level intution for how the process works.
Before we dive into deep reinforcement learning, it might be a good idea to refresh ourselves on how regular reinforcement learning works. In reinforcement learning, goal-oriented algorithms are designed through a process of trial and error, optimizing for the action that leads to the best result/the action that gains the most “reward”. When reinforcement learning algorithms are trained, they are given “rewards” or “punishments” that influence which actions they will take in the future. Algorithms try to find a set of actions that will provide the system with the most reward, balancing both immediate and future rewards.
Reinforcement learning algorithms are very powerful because they can be applied to almost any task, being able to flexibly and dynamically learn from an environment and discover possible actions.
Overview of Deep Reinforcement Learning
When it comes to deep reinforcement learning, the environment is typically represented with images. An image is a capture of the environment at a particular point in time. The agent must analyze the images and extract relevant information from them, using the information to inform which action they should take. Deep reinforcement learning is typically carried out with one of two different techniques: value-based learning and policy-based learning.
Value-based learning techniques make use of algorithms and architectures like convolutional neural networks and Deep-Q-Networks. These algorithms operate by converting the image to greyscale and cropping out unnecessary parts of the image. Afterward, the image undergoes various convolutions and pooling operations, extracting the most relevant portions of the image. The important parts of the image are then used to calculate the Q-value for the different actions the agent can take. Q-values are used to determine the best course of action for the agent. After the initial Q-values are calculated, backpropagation is carried out in order that the most accurate Q-values can be determined.
Policy-based methods are used when the number of possible actions that the agent can take is extremely high, which is typically the case in real-world scenarios. Situations like these require a different approach because calculating the Q-values for all the individual actions isn’t pragmatic. Policy-based approaches operate without calculating function values for individual actions. Instead, they adopt policies by learning the policy directly, often through techniques called Policy Gradients.
Policy gradients operate by receiving a state and calculating probabilities for actions based on the agent’s prior experiences. The most probable action is then selected. This process is repeated until the end of the evaluation period and the rewards are given to the agent. After the rewards have been dealt with the agent, the network’s parameters are updated with backpropagation.
A Closer Look at Q-Learning
Because Q-Learning is such a large part of the deep reinforcement learning process, let’s take some time to really understand how the Q-learning system works.
The Markov Decision Process
In order for an AI agent to carry out a series of tasks and reach a goal, the agent must be able to deal with a sequence of states and events. The agent will begin at one state and it must take a series of actions to reach an end state, and there can be a massive number of states existing between the beginning and end states. Storing information regarding every state is impractical or impossible, so the system must find a way to preserve just the most relevant state information. This is accomplished through the use of a Markov Decision Process, which preserves just the information regarding the current state and the previous state. Every state follows a Markov property, which tracks how the agent change from the previous state to the current state.
Once the model has access to information about the states of the learning environment, Q-values can be calculated. The Q-values are the total reward given to the agent at the end of a sequence of actions.
The Q-values are calculated with a series of rewards. There is an immediate reward, calculated at the current state and depending on the current action. The Q-value for the subsequent state is also calculated, along with the Q-value for the state after that, and so on until all the Q-values for the different states have been calculated. There is also a Gamma parameter that is used to control how much weight future rewards have on the agent’s actions. Policies are typically calculated by randomly initializing Q-values and letting the model converge toward the optimal Q-values over the course of training.
One of the fundamental problems involving the use of Q-learning for reinforcement learning is that the amount of memory required to store data rapidly expands as the number of states increases. Deep Q Networks solve this problem by combining neural network models with Q-values, enabling an agent to learn from experience and make reasonable guesses about the best actions to take. With deep Q-learning, the Q-value functions are estimated with neural networks. The neural network takes the state in as the input data, and the network outputs Q-value for all the different possible actions the agent might take.
Deep Q-learning is accomplished by storing all the past experiences in memory, calculating maximum outputs for the Q-network, and then using a loss function to calculate the difference between current values and the theoretical highest possible values.
Deep Reinforcement Learning vs Deep Learning
One important difference between deep reinforcement learning and regular deep learning is that in the case of the former the inputs are constantly changing, which isn’t the case in traditional deep learning. How can the learning model account for inputs and outputs that are constantly shifting?
Essentially, to account for the divergence between predicted values and target values, two neural networks can be used instead of one. One network estimates the target values, while the other network is responsible for the predictions. The parameters of the target network are updated as the model learns, after a chosen number of training iterations have passed. The outputs of the respective networks are then joined together to determine the difference.
Policy-based learning approaches operate differently than Q-value based approaches. While Q-value approaches create a value function that predicts rewards for states and actions, policy-based methods determine a policy that will map states to actions. In other words, the policy function that selects for actions is directly optimized without regard to the value function.
A policy for deep reinforcement learning falls into one of two categories: stochastic or deterministic. A deterministic policy is one where states are mapped to actions, meaning that when the policy is given information about a state an action is returned. Meanwhile, stochastic policies return a probability distribution for actions instead of a single, discrete action.
Deterministic policies are used when there is no uncertainty about the outcomes of the actions that can be taken. In other words, when the environment itself is deterministic. In contrast, stochastic policy outputs are appropriate for environments where the outcome of actions is uncertain. Typically, reinforcement learning scenarios involve some degree of uncertainty so stochastic policies are used.
Policy gradient approaches have a few advantages over Q-learning approaches, as well as some disadvantages. In terms of advantages, policy-based methods converge on optimal parameters quicker and more reliably. The policy gradient can just be followed until the best parameters are determined, whereas with value-based methods small changes in estimated action values can lead to large changes in actions and their associated parameters.
Policy gradients work better for high dimensional action spaces as well. When there is an extremely high number of possible actions to take, deep Q-learning becomes impractical because it must assign a score to every possible action for all time steps, which may be impossible computationally. However, with policy-based methods, the parameters are adjusted over time and the number of possible best parameters quickly shrinks as the model converges.
Policy gradients are also capable of implementing stochastic policies, unlike value-based policies. Because stochastic policies produce a probability distribution, an exploration/exploitation trade-off does not need to be implemented.
In terms of disadvantages, the main disadvantage of policy gradients is that they can get stuck while searching for optimal parameters, focusing only on a narrow, local set of optimum values instead of the global optimum values.
Policy Score Function
The policies used to optimize a model’s performance aim to maximize a score function – J(θ). If J(θ) is a measure of how good our policy is for achieving the desired goal, we can find the values of “θ” that gives us the best policy. First, we need to calculate an expected policy reward. We estimate the policy reward so we have an objective, something to optimize towards. The Policy Score Function is how we calculate the expected policy reward, and there are different Policy Score Functions that are commonly used, such as: start values for episodic environments, the average value for continuous environments, and the average reward per time step.
Policy Gradient Ascent
After the desired Policy Score Function is used, and an expected policy reward calculated, we can find a value for the parameter “θ” which maximizes the score function. In order to maximize the score function J(θ), a technique called “gradient ascent” is used. Gradient ascent is similar in concept to gradient descent in deep learning, but we are optimizing for the steepest increase instead of decrease. This is because our score is not “error”, like in many deep learning problems. Our score is something we want to maximize. An expression called the Policy Gradient Theorem is used to estimate the gradient with respect to policy “θ”.
In summary, deep reinforcement learning combines aspects of reinforcement learning and deep neural networks. Deep reinforcement learning is done with two different techniques: Deep Q-learning and policy gradients.
Deep Q-learning methods aim to predict which rewards will follow certain actions taken in a given state, while policy gradient approaches aim to optimize the action space, predicting the actions themselves. Policy-based approaches to deep reinforcement learning are either deterministic or stochastic in nature. Deterministic policies map states directly to actions while stochastic policies produce probability distributions for actions.
What is Bayes Theorem?
If you’ve been learning about data science or machine learning, there’s a good chance you’ve heard the term “Bayes Theorem” before, or a “Bayes classifier”. These concepts can be somewhat confusing, especially if you aren’t used to thinking of probability from a traditional, frequentist statistics perspective. This article will attempt to explain the principles behind Bayes Theorem and how it’s used in machine learning.
Defining Bayes Theorem
Bayes Theorem is a method of calculating conditional probability. The traditional method of calculating conditional probability (the probability that one event occurs given the occurrence of a different event) is to use the conditional probability formula, calculating the joint probability of event one and event two occurring at the same time, and then dividing it by the probability of event two occurring. However, conditional probability can also be calculated in a slightly different fashion by using Bayes Theorem.
When calculating conditional probability with Bayes theorem, you use the following steps:
- Determine the probability of condition B being true, assuming that condition A is true.
- Determine the probability of event A being true.
- Multiply the two probabilities together.
- Divide by the probability of event B occurring.
This means that the formula for Bayes Theorem could be expressed like this:
P(A|B) = P(B|A)*P(A) / P(B)
Calculating the conditional probability like this is especially useful when the reverse conditional probability can be easily calculated, or when calculating the joint probability would be too challenging.
A Practical Example
This might be easier to interpret if we spend some time looking at an example of how you would apply Bayesian reasoning and Bayes Theorem. Let’s assume you were playing a simple game where multiple participants tell you a story and you have to determine which one of the participants is lying to you. Let’s fill in the equation for Bayes Theorem with the variables in this hypothetical scenario.
We’re trying to predict whether each individual in the game is lying or telling the truth, so if there are three players apart from you, the categorical variables can be expressed as A1, A2, and A3. The evidence for their lies/truth is their behavior. Like when playing poker, you would look for certain “tells” that a person is lying and use those as bits of information to inform your guess. Or if you were allowed to question them it would be any evidence their story doesn’t add up. We can represent the evidence that a person is lying as B.
To be clear, we’re aiming to predict Probability(A is lying/telling the truth|given the evidence of their behavior). To do this we’d want to figure out the probability of B given A, or the probability that their behavior would occur given the person genuinely lying or telling the truth. You’re trying to determine under which conditions the behavior you are seeing would make the most sense. If there are three behaviors you are witnessing, you would do the calculation for each behavior. For example, P(B1, B2, B3 * A). You would then do this for every occurrence of A/for every person in the game aside from yourself. That’s this part of the equation above:
P(B1, B2, B3,|A) * P|A
Finally, we just divide that by the probability of B.
If we received any evidence about the actual probabilities in this equation, we would recreate our probability model, taking the new evidence into account. This is called updating your priors, as you update your assumptions about the prior probability of the observed events occurring.
Machine Learning Applications
The most common use of Bayes theorem when it comes to machine learning is in the form of the Naive Bayes algorithm.
Naive Bayes is used for the classification of both binary and multi-class datasets, Naive Bayes gets its name because the values assigned to the witnesses evidence/attributes – Bs in P(B1, B2, B3 * A) – are assumed to be independent of one another. It’s assumed that these attributes don’t impact each other in order to simplify the model and make calculations possible, instead of attempting the complex task of calculating the relationships between each of the attributes. Despite this simplified model, Naive Bayes tends to perform quite well as a classification algorithm, even when this assumption probably isn’t true (which is most of the time).
There are also commonly used variants of the Naive Bayes classifier such as Multinomial Naive Bayes, Bernoulli Naive Bayes, and Gaussian Naive Bayes.
Multinomial Naive Bayes algorithms are often used to classify documents, as it is effective at interpreting the frequency of words within a document.
Bernoulli Naive Bayes operates similarly to Multinomial Naive Bayes, but the predictions rendered by the algorithm are booleans. This means that when predicting a class the values will be binary, no or yes. In the domain of text classification, a Bernoulli Naive Bayes algorithm would assign the parameters a yes or no based on whether or not a word is found within the text document.
If the value of the predictors/features aren’t discrete but are instead continuous, Gaussian Naive Bayes can be used. It’s assumed that the values the continuous features have been sampled from a gaussian distribution.
- Microsoft to Replace Dozens of Journalists With AI
- AI Model Might Let Game Developers Generate Lifelike Animations
- Akilesh Bapu, Founder & CEO of DeepScribe – Interview Series
- AI Models Trained On Sex Biased Data Perform Worse At Diagnosing Disease
- Stefano Pacifico, and David Heeger, Co-Founders of Epistemic AI – Interview Series