### AI 101

# Supervised vs Unsupervised Learning

In machine learning, most tasks can be easily categorized into one of two different classes: supervised learning problems or unsupervised learning problems. In supervised learning, data has labels or classes appended to it, while in the case of unsupervised learning the data is unlabeled. Let’s take a close look at why this distinction is important and look at some of the algorithms associated with each type of learning.

## Supervised Vs. Unsupervised Learning

Most machine learning tasks are in the domain of supervised learning. In supervised learning algorithms, the individual instances/data points in the dataset have a class or label assigned to them. This means that the machine learning model can learn to distinguish which features are correlated with a given class and that the machine learning engineer can check the model’s performance by seeing how many instances were properly classified. Classification algorithms can be used to discern many complex patterns, as long as the data is labeled with the proper classes. For instance, a machine-learning algorithm can learn to distinguish different animals from each other based off of characteristics like “whiskers”, “tail”, “claws”, etc.

In contrast to supervised learning, unsupervised learning involves creating a model that is able to extract patterns from unlabeled data. In other words, the computer analyzes the input features and determines for itself what the most important features and patterns are. Unsupervised learning tries to find the inherent similarities between different instances. If a supervised learning algorithm aims to place data points into known classes, unsupervised learning algorithms will examine the features common to the object instances and place them into groups based on these features, essentially creating its own classes.

Examples of supervised learning algorithms are Linear Regression, Logistic Regression, K-nearest Neighbors, Decision Trees, and Support Vector Machines.

Meanwhile, some examples of unsupervised learning algorithms are Principal Component Analysis and K-Means Clustering.

## Supervised Learning Algorithm Examples

Linear Regression is an algorithm that takes two features and plots out the relationship between them. Linear Regression is used to predict numerical values in relation to other numerical variables. Linear Regression has the equation of Y = a +bX, where b is the line’s slope and a is where y crosses the X-axis.

Logistic Regression is a binary classification algorithm. The algorithm examines the relationship between numerical features and finds the probability that the instance can be classified into one of two different classes. The probability values are “squeezed” towards either 0 or 1. In other words, strong probabilities will approach 0.99 while weak probabilities will approach 0.

K-Nearest Neighbors assigns a class to new data points based on the assigned classes of some chosen amount of neighbors in the training set. The number of neighbors considered by the algorithm is important, and too few or too many neighbors can misclassify points.

Decision Trees are a type of classification and regression algorithm. A decision tree operates by splitting up a dataset down into smaller and smaller portions until the subsets can’t be split any further and what results is a tree with nodes and leaves. The nodes are where decisions about data points are made using different filtering criteria, while the leaves are the instances that have been assigned some label (a data point that has been classified). Decision tree algorithms are capable of handling both numerical and categorical data. Splits are made in the tree on specific variables/features.

Support Vector Machines are a classification algorithm that operates by drawing hyperplanes, or lines of separation, between data points. Data points are separated into classes based upon which side of the hyperplane they are on. Multiple hyperplanes can be drawn across a plane, diving a dataset into multiple classes. The classifier will try to maximize the distance between the diving hyperplane and the points on either side of the plane, and the greater the distance between the line and the points, the more confident the classifier is.

## Unsupervised Learning Algorithms

Principal Component Analysis is a technique used for dimensionality reduction, meaning that the dimensionality or complexity of the data is represented in a simpler fashion. The Principal Component Analysis algorithm finds new dimensions for the data that are orthogonal. While the dimensionality of the data is reduced, the variance between the data should be preserved as much as possible. What this means in practical terms is that it takes the features in the dataset and distills them down into fewer features that represent most of the data.

K-Means Clustering is an algorithm that automatically groups data points into clusters based on similar features. The patterns within the dataset are analyzed and the datapoints split into groups based on these patterns. Essentially, K-means creates its own classes out of unlabeled data. The K-Means algorithm operates by assigning centers to the clusters, or centroids, and moving the centroids until the optimal position for the centroids is found. The optimal position will be one where the distance between the centroids to the surrounding data points within the class is minimized. The “K” in K-means clustering refers to how many centroids have been chosen.

## Summing Up

To close, let’s quickly go over the key differences between supervised and unsupervised learning.

As we previously discussed, in supervised learning tasks the input data is labeled and the number of classes are known. Meanwhile, input data is unlabeled and the number of classes not known in unsupervised learning cases. Unsupervised learning tends to be less computationally complex, whereas supervised learning tends to be more computationally complex. While supervised learning results tend to be highly accurate, unsupervised learning results tend to be less accurate/moderately accurate.

## To Learn More

Recommended Machine Learning Courses | Offered By | Duration | Difficulty |
---|---|---|---|

| IBM | 9 Hours | Beginner |

Yonsei University | 8 Hours | Beginner | |

Intel Software | 12 Hours | Intermediate | |

University of Washingotn | 24 Hours | Intermediate |

### AI 101

# What is Gradient Descent?

If you’ve read about how neural networks are trained, you’ve almost certainly come across the term “gradient descent” before. Gradient descent is the primary method of optimizing a neural network’s performance, reducing the network’s loss/error rate. However, gradient descent can be a little hard to understand for those new to machine learning, and this article will endeavor to give you a decent intuition for how gradient descent operates.

Gradient descent is an optimization algorithm. It’s used to improve the performance of a neural network by making tweaks to the parameters of the network such that the difference between the network’s predictions and the actual/expected values of the network (referred to as the loss) is a small as possible. Gradient descent takes the initial values of the parameters and uses operations based in calculus to adjust their values towards the values that will make the network as accurate as it can be. You don’t need to know a lot of calculus to understand how gradient descent works, but you do need to have an understanding of gradients.

## What Are Gradients?

Assume that there is a graph that represents the amount of error a neural network makes. The bottom of the graph represents the points of lowest error while the top of the graph is where the error is the highest. We want to move from the top of the graph down to the bottom. A gradient is just a way of quantifying the relationship between error and the weights of the neural network. The relationship between these two things can be graphed as a slope, with incorrect weights producing more error. The steepness of the slope/gradient represents how fast the model is learning.

A steeper slope means large reductions in error are being made and the model is learning fast, whereas if the slope is zero the model is on a plateau and isn’t learning. We can move down the slope towards less error by calculating a gradient, a direction of movement (change in the parameters of the network) for our model.

Let’s shift the metaphor just slightly and imagine a series of hills and valleys. We want to get to the bottom of the hill and find the part of the valley that represents the lowest loss. When we start at the top of the hill we can take large steps down the hill and be confident that we are heading towards the lowest point in the valley.

However, as we get closer to the lowest point in the valley, our steps will need to become smaller, or else we could overshoot the true lowest point. Similarly, it’s possible that when adjusting the weights of the network, the adjustments can actually take it further away from the point of lowest loss, and therefore the adjustments must get smaller over time. In the context of descending a hill towards a point of lowest loss, the gradient is a vector/instructions detailing the path we should take and how large our steps should be.

Now we know that gradients are instructions that tell us which direction to move in (which coefficients should be updated) and how large the steps we should take are (how much the coefficients should be updated), we can explore how the gradient is calculated.

## Calculating Gradients and Gradient Descent Procedure

In order to carry out gradient descent, the gradients must first be calculated. In order to calculate the gradient, we need to know the loss/cost function. We’ll use the cost function to determine the derivative. In calculus, the derivative just refers to the slope of a function at a given point, so we’re basically just calculating the slope of the hill based on the loss function. We determine the loss by running the coefficients through the loss function. If we represent the loss function as “f”, then we can state that the equation for calculating the loss is as follows (we’re just running the coefficients through our chosen cost function):

Loss = f(coefficient)

We then calculate the derivative, or determine the slope. Getting the derivative of the loss will tell us which direction is up or down the slope, by giving us the appropriate sign to adjust our coefficients by. We’ll represent the appropriate direction as “delta”.

delta = derivative_function(loss)

We’ve now determined which direction is downhill towards the point of lowest loss. This means we can update the coefficients in the neural network parameters and hopefully reduce the loss. We’ll update the coefficients based on the previous coefficients minus the appropriate change in value as determined by the direction (delta) and an argument that controls the magnitude of change (the size of our step). The argument that controls the size of the update is called the “learning rate” and we’ll represent it as “alpha”.

coefficient = coefficient – (alpha * delta)

We then just repeat this process until the network has converged around the point of lowest loss, which should be near zero.

It’s very important to choose the right value for the learning rate (alpha). The chosen learning rate must be neither too small or too large. Remember that as we approach the point of lowest loss our steps must become smaller or else we will overshoot the true point of lowest loss and end up on the other side. The point of smallest loss is small and if our rate of change is too large the error can end up increasing again. If the step sizes are too large the network’s performance will continue to bounce around the point of lowest loss, overshooting it on one side and then the other. If this happens the network will never converge on the true optimal weight configuration.

In contrast, if the learning rate is too small the network can potentially take an extraordinarily long time to converge on the optimal weights.

## Types Of Gradient Descent

Now that we understand how gradient descent works in general, let’s take a look at some of the different types of gradient descent.

Batch Gradient Descent: This form of gradient descent runs through all the training samples before updating the coefficients. This type of gradient descent is likely to be the most computationally efficient form of gradient descent, as the weights are only updated once the entire batch has been processed, meaning there are fewer updates total. However, if the dataset contains a large number of training examples, then batch gradient descent can make training take a long time.

Stochastic Gradient Descent: In Stochastic Gradient Descent only a single training example is processed for every iteration of gradient descent and parameter updating. This occurs for every training example. Because only one training example is processed before the parameters are updated, it tends to converge faster than Batch Gradient Descent, as updates are made sooner. However, because the process must be carried out on every item in the training set, it can take quite a long time to complete if the dataset is large, and so use of one of the other gradient descent types if preferred.

Mini-Batch Gradient Descent: Mini-Batch Gradient Descent operates by splitting the entire training dataset up into subsections. It creates smaller mini-batches that are run through the network, and when the mini-batch has been used to calculate the error the coefficients are updated. Mini-batch Gradient Descent strikes a middle ground between Stochastic Gradient Descent and Batch Gradient Descent. The model is updated more frequently than in the case of Batch Gradient Descent, which means a slightly faster and more robust convergence on the model’s optimal parameters. It’s also more computationally efficient than Stochastic Gradient Descent

### AI 101

# What is Backpropagation?

Deep learning systems are able to learn extremely complex patterns, and they accomplish this by adjusting their weights. How are the weights of a deep neural network adjusted exactly? They are adjusted through a process called backpropagation. Without backpropagation, deep neural networks wouldn’t be able to carry out tasks like recognizing images and interpreting natural language. Understanding how backpropagation works is critical to understanding deep neural networks in general, so let’s delve into backpropagation and see how the process is used to adjust a network’s weights.

Backpropagation can be difficult to understand, and the calculations used to carry out backpropagation can be quite complex. This article will endeavor to give you an intuitive understanding of backpropagation, using little in the way of complex math. However, some discussion of the math behind backpropagation is necessary.

## The Goal of Backprop

Let’s start by defining the goal of backpropagation. The weights of a deep neural network are the strength of connections between units of a neural network. When the neural network is established assumptions are made about how the units in one layer are connected to the layers joined with it. As the data moves through the neural network, the weights are calculated and assumptions are made. When the data reaches the final layer of the network, a prediction is made about how the features are related to the classes in the dataset. The difference between the predicted values and the actual values is the loss/error, and the goal of backpropagation is to reduce the loss. This is accomplished by adjusting the weights of the network, making the assumptions more like the true relationships between the input features.

## Training A Deep Neural Network

Before backpropagation can be done on a neural network, the regular/forward training pass of a neural network must be carried out. When a neural network is created, a set of weights is initialized. The value of the weights will be altered as the network is trained. The forward training pass of a neural network can be conceived of as three discrete steps: neuron activation, neuron transfer, and forward propagation.

When training a deep neural network, we need to make use of multiple mathematical functions. Neurons in a deep neural network are comprised of the incoming data and an activation function, which determines the value necessary to activate the node. The activation value of a neuron is calculated with several components, being a weighted sum of the inputs. The weights and input values depend on the index of the nodes being used to calculate the activation. Another number must be taken into account when calculating the activation value, a bias value. Bias values don’t fluctuate, so they aren’t multiplied together with the weight and inputs, they are just added. All of this means that the following equation could be used to calculate the activation value:

Activation = sum(weight * input) + bias

After the neuron is activated, an activation function is used to determine what the output of the actual output of the neuron will be. Different activation functions are optimal for different learning tasks, but commonly used activation functions include the sigmoid function, the Tanh function, and the ReLU function.

Once the outputs of the neuron are calculated by running the activation value through the desired activation function, forward propagation is done. Forward propagation is just taking the outputs of one layer and making them the inputs of the next layer. The new inputs are then used to calculate the new activation functions, and the output of this operation passed on to the following layer. This process continues all the way through to the end of the neural network.

## Backpropagation

The process of backpropagation takes in the final decisions of a model’s training pass, and then it determines the errors in these decisions. The errors are calculated by contrasting the outputs/decisions of the network and the expected/desired outputs of the network.

Once the errors in the network’s decisions have been calculated, this information is backpropagated through the network and the parameters of the network are altered along the way. The method that is used to update the weights of the network is based in calculus, specifically, it’s based in the chain-rule. However, an understanding of calculus isn’t necessary to understand the idea of behind backpropagation. Just know that when an output value is provided from a neuron, the slope of the output value is calculated with a transfer function, producing a derived output. When doing backpropagation, the error for a specific neuron is calculated according to the following formula:

error = (expected_output – actual_output) * slope of neuron’s output value

When operating on the neurons in the output layer, the class value is used as the expected value. After the error has been calculated, the error is used as the input for the neurons in the hidden layer, meaning that the error for this hidden layer is the weighted errors of the neurons found within the output layer. The error calculations travel backward through the network along the weights network.

After the errors for the network have been calculated, the weights in the network must be updated. As mentioned, calculating the error involves determining the slope of the output value. After the slope has been calculated, a process known as gradient descent can be used to adjust the weights in the network. A gradient is a slope, whose angle/steepness can be measured. Slope is calculated by plotting “y over” or the “rise” over the “run”. In the case of the neural network and the error rate, the “y” is the calculated error, while the “x” is the network’s parameters. The network’s parameters have a relationship to the calculated error values, and as the network’s weights are adjusted the error increases or decreases.

“Gradient descent” is the process of updating the weights so that the error rate decreases. Backpropagation is used to predict the relationship between the neural network’s parameters and the error rate, which sets up the network for gradient descent. Training a network with gradient descent involved calculating the weights through forward propagation, backpropagating the error, and then updating the weights of the network.

### AI 101

# What is Meta-Learning?

One of the fastest-growing areas of research in machine learning is the area of meta-learning. Meta-learning, in the machine learning context, is the use of machine learning algorithms to assist in the training and optimization of other machine learning models. As meta-learning is becoming more and more popular and more meta-learning techniques are being developed, it’s beneficial to have an understanding of what meta-learning is and to have a sense of the various ways it can be applied. Let’s examine the ideas behind meta-learning, types of meta-learning, as well as some of the ways meta-learning can be used.

## Defining Meta-Learning

The term meta-learning was coined by Donald Maudsley to describe a process by which people begin to shape what they learn, becoming “increasingly in control of habits of perception, inquiry, learning, and growth that they have internalized”. Later, cognitive scientists and psychologists would describe meta-learning as “learning how to learn”.

For the machine learning version of meta-learning, the general idea of “learning how to learn” is applied to AI systems. In the AI sense, meta-learning is the ability of an artificially intelligent machine to learn how to carry out various complex tasks, taking the principles it used to learn one task and applying it to other tasks. AI systems typically have to be trained to accomplish a task through the mastering of many small subtasks. This training can take a long time and AI agents don’t easily transfer the knowledge learned during one task to another task. Creating meta-learning models and techniques can help AI learn to generalize learning methods and acquire new skills quicker.

## Types of Meta-Learning

**Optimizer Meta-Learning**

Meta-learning is often employed to optimize the performance of an already existing neural network. Optimizer meta-learning methods typically function by tweaking the hyperparameters of a different neural network in order to improve the performance of the base neural network. The result is that the target network should become better at performing the task it is being trained on. One example of a meta-learning optimizer is the use of a network to improve gradient descent results.

**Few-Shots Meta-Learning**

A few-shots meta-learning approach is one where a deep neural network is engineered which is capable of generalizing from the training datasets to unseen datasets. An instance of few-shot classification is similar to a normal classification task, but instead, the data samples are entire datasets. The model is trained on many different learning tasks/datasets and then it’s optimized for peak performance on the multitude of training tasks and unseen data. In this approach, a single training sample is split up into multiple classes. This means that each training sample/dataset could potentially be made up of two classes, for a total of 4-shots. In this case, the total training task could be described as a 4-shot 2-class classification task.

In few-shot learning, the idea is that the individual training samples are minimalistic and that the network can learn to identify objects after having seen just a few pictures. This is much like how a child learns to distinguish objects after seeing just a couple of pictures. This approach has been used to create techniques like one-shot generative models and memory augmented neural networks.

**Metric Meta-Learning**

Metric based meta-learning is the utilization of neural networks to determine if a metric is being used effectively and if the network or networks are hitting the target metric. Metric meta-learning is similar to few-shot learning in that just a few examples are used to train the network and have it learn the metric space. The same metric is used across the diverse domain and if the networks diverge from the metric they are considered to be failing.

**Recurrent Model Meta-Learning**

Recurrent model meta-learning is the application of meta-learning techniques to Recurrent Neural Networks and the similar Long Short-Term Memory networks. This technique operates by training the RNN/LSTM model to sequentially learn a dataset and then using this trained model as a basis for another learner. The meta-learner takes on board the specific optimization algorithm that was used to train the initial model. The inherited parameterization of the meta-learner enables it to quickly initialize and converge, but still be able to update for new scenarios.

## How Does Meta-Learning Work?

The exact way that meta-learning is conducted varies depending on the model and the nature of the task at hand. However, in general, a meta-learning task involves copying over the parameters of the first network into the parameters of the second network/the optimizer.

There are two training processes in meta-learning. The meta-learning model is typically trained after several steps of training on the base model have been carried out. After the forward, backward, and optimization steps that train the base model, the forward training pass is carried out for the optimization model. For example, after three or four steps of training on the base model, a meta-loss is computed. After the meta-loss is computed, the gradients are computed for each meta-parameter. After this occurs, the meta-parameters in the optimizer are updated.

One possibility for calculating the meta-loss is to finish the forward training pass of the initial model and then combine the losses that have already been computed. The meta-optimizer could even be another meta-learner, though at a certain point a discrete optimizer like ADAM or SGD must be used.

Many deep learning models can have hundreds of thousands or even millions of parameters. Creating a meta-learner that has an entirely new set of parameters would be computationally expensive, and for this reason, a tactic called coordinate-sharing is typically used. Coordinate-sharing involves engineering the meta-learner/optimizer so that it learns a single parameter from the base model and then just clones that parameter in place of all of the other parameters. The result is that the parameters the optimizer possesses don’t depend on the parameters of the model.