When you train a neural network, you have to avoid overfitting. Overfitting is an issue within machine learning and statistics where a model learns the patterns of a training dataset too well, perfectly explaining the training data set but failing to generalize its predictive power to other sets of data. To put that another way, in the case of an overfitting model it will often show extremely high accuracy on the training dataset but low accuracy on data collected and run through the model in the future. That’s a quick definition of overfitting, but let’s go over the concept of overfitting in more detail. Let’s take a look at how overfitting occurs and how it can be avoided.
Understanding “Fit” and Overfitting
Before we delve too deeply into overfitting, it might be helpful to take a look at the concept of underfitting and “fit” generally. When we train a model we are trying to develop a framework that is capable of predicting the nature, or class, of items within a dataset, based on the features that describe those items. A model should be able to explain a pattern within a dataset and predict the classes of future data points based off of this pattern. The better the model explains the relationship between the features of the training set, the more “fit” our model is.
A model that poorly explains the relationship between the features of the training data and thus fails to accurately classify future data examples is underfitting the training data. If you were to graph the predicted relationship of an underfitting model against the actual intersection of the features and labels, the predictions would veer off the mark. If we had a graph with the actual values of a training set labeled, a severely underfitting model would drastically miss most of the data points. A model with a better fit might cut a path through the center of the data points, with individual data points being off of the predicted values by only a little.
Underfitting can often occur when there is insufficient data to create an accurate model, or when trying to design a linear model with non-linear data. More training data or more features will often help reduce underfitting.
So why wouldn’t we just create a model that explains every point in the training data perfectly? Surely perfect accuracy is desirable? Creating a model that has learned the patterns of the training data too well is what causes overfitting. The training data set and other, future datasets you run through the model will not be exactly the same. They will likely be very similar in many respects, but they will also differ in key ways. Therefore, designing a model that explains the training dataset perfectly means you end up with a theory about the relationship between features that doesn’t generalize well to other datasets.
Overfitting occurs when a model learns the details within the training dataset too well, causing the model to suffer when predictions are made on outside data. This may occur when the model not only learns the features of the dataset, it also learns random fluctuations or noise within the dataset, placing importance on these random/unimportant occurrences.
Overfitting is more likely to occur when nonlinear models are used, as they are more flexible when learning data features. Nonparametric machine learning algorithms often have various parameters and techniques that can be applied to constrain the model’s sensitivity to data and thereby reduce overfitting. As an example, decision tree models are highly sensitive to overfitting, but a technique called pruning can be used to randomly remove some of the detail that the model has learned.
If you were to graph out the predictions of the model on X and Y axes, you would have a line of prediction that zigzags back and forth, which reflects the fact that the model has tried too hard to fit all the points in the dataset into its explanation.
Controlling Overfitting And Getting A Good Fit
When we train a model, we ideally want the model to make no errors. When the model’s performance converges towards making correct predictions on all the data points in the training dataset, the fit is becoming better. A model with a good fit is able to explain almost all of the training dataset without overfitting.
As a model trains its performance improves over time. The model’s error rate will decrease as training time passes, but it only decreases to a certain point. The point at which the model’s performance on the test set begins to rise again is typically the point at which overfitting is occurring. In order to get the best fit for a model, we want to stop training the model at the point of lowest loss on the training set, before error starts increasing again. The optimal stopping point can be ascertained by graphing the performance of the model throughout the training time and stopping training when loss is lowest. However, one risk with this method of controlling for overfitting is that specifying the endpoint for the training based on test performance means that the test data becomes somewhat included in the training procedure, and it loses its status as purely “untouched” data.
There are a couple of different ways that one can combat overfitting. One method of reducing overfitting is to use a resampling tactic, which operates by estimating the accuracy of the model. You can also use a validation dataset in addition to the test set and plot the training accuracy against the validation set instead of the test dataset. This keeps your test dataset unseen. A popular resampling method is K-folds cross-validation. This technique enables you to divide your data into subsets that the model is trained on, and then the performance of the model on the subsets is analyzed to estimate how the model will perform on outside data.
Making use of cross-validation is one of the best ways to estimate a model’s accuracy on unseen data, and when combined with a validation dataset overfitting can often be kept to a minimum.
What is Linear Regression?
Linear regression is an algorithm used to predict, or visualize, a relationship between two different features/variables. In linear regression tasks, there are two kinds of variables being examined: the dependent variable and the independent variable. The independent variable is the variable that stands by itself, not impacted by the other variable. As the independent variable is adjusted, the levels of the dependent variable will fluctuate. The dependent variable is the variable that is being studied, and it is what the regression model solves for/attempts to predict. In linear regression tasks, every observation/instance is comprised of both the dependent variable value and the independent variable value.
That was a quick explanation of linear regression, but let’s make sure we come to a better understanding of linear regression by looking at an example of it and examining the formula that it uses.
Understanding Linear Regression
Assume that we have a dataset covering hard-drive sizes and the cost of those hard drives.
Let’s suppose that the dataset we have is comprised of two different features: the amount of memory and cost. The more memory we purchase for a computer, the more the cost of the purchase goes up. If we plotted out the individual data points on a scatter plot, we might get a graph that looks something like this:
The exact memory-to-cost ratio might vary between manufacturers and models of hard drive, but in general, the trend of the data is one that starts in the bottom left (where hard drives are both cheaper and have smaller capacity) and moves to the upper right (where the drives are more expensive and have higher capacity).
If we had the amount of memory on the X-axis and the cost on the Y-axis, a line capturing the relationship between the X and Y variables would start in the lower-left corner and run to the upper right.
The function of a regression model is to determine a linear function between the X and Y variables that best describes the relationship between the two variables. In linear regression, it’s assumed that Y can be calculated from some combination of the input variables. The relationship between the input variables (X) and the target variables (Y) can be portrayed by drawing a line through the points in the graph. The line represents the function that best describes the relationship between X and Y (for example, for every time X increases by 3, Y increases by 2). The goal is to find an optimal “regression line”, or the line/function that best fits the data.
Lines are typically represented by the equation: Y = m*X + b. X refers to the dependent variable while Y is the independent variable. Meanwhile, m is the slope of the line, as defined by the “rise” over the “run”. Machine learning practitioners represent the famous slope-line equation a little differently, using this equation instead:
y(x) = w0 + w1 * x
In the above equation, y is the target variable while “w” is the model’s parameters and the input is “x”. So the equation is read as: “The function that gives Y, depending on X, is equal to the parameters of the model multiplied by the features”. The parameters of the model are adjusted during training to get the best-fit regression line.
The process described above applies to simple linear regression, or regression on datasets where there is only a single feature/independent variable. However, a regression can also be done with multiple features. In the case of “multiple linear regression”, the equation is extended by the number of variables found within the dataset. In other words, while the equation for regular linear regression is y(x) = w0 + w1 * x, the equation for multiple linear regression would be y(x) = w0 + w1x1 plus the weights and inputs for the various features. If we represent the total number of weights and features as w(n)x(n), then we could represent the formula like this:
y(x) = w0 + w1x1 + w2x2 + … + w(n)x(n)
After establishing the formula for linear regression, the machine learning model will use different values for the weights, drawing different lines of fit. Remember that the goal is to find the line that best fits the data in order to determine which of the possible weight combinations (and therefore which possible line) best fits the data and explains the relationship between the variables.
A cost function is used to measure how close the assumed Y values are to the actual Y values when given a particular weight value. The cost function for linear regression is mean squared error, which just takes the average (squared) error between the predicted value and the true value for all of the various data points in the dataset. The cost function is used to calculate a cost, which captures the difference between the predicted target value and the true target value. If the fit line is far from the data points, the cost will be higher, while the cost will become smaller the closer the line gets to capturing the true relationships between variables. The weights of the model are then adjusted until the weight configuration that produces the smallest amount of error is found.
What are Support Vector Machines?
Support vector machines are a type of machine learning classifier, arguably one of the most popular kinds of classifiers. Support vector machines are especially useful for numerical prediction, classification, and pattern recognition tasks.
Support vector machines operate by drawing decision boundaries between data points, aiming for the decision boundary that best separates the data points into classes (or is the most generalizable). The goal when using a support vector machine is that the decision boundary between the points is as large as possible so that the distance between any given data point and the boundary line is maximized. That’s a quick explanation of how support vector machines (SVMs) operate, but let’s take some time to delve deeper into how SVMs operate and understand the logic behind their operation.
The Goal Of Support Vector Machines
Imagine a graph with a number of data points on it, based on features specified by the X and Y axes. The data points on the graph can loosely be divided up into two different clusters, and the cluster that a data point belongs to indicates the class of the data point. Now assume that we want to draw a line down the graph that separates the two classes from each other, with all the data points in one class found on one side of the line and all the data points belonging to another class found on the other side of the line. This separating line is known as a hyperplane.
You can think of a support vector machine as creating “roads” throughout a city, separating the city into districts on either side of the road. All the buildings (data points) that are found on one side of the road belong to one district.
The goal of a support vector machine is not only to draw hyperplanes and divide data points, but to draw the hyperplane the separates data points with the largest margin, or with the most space between the dividing line and any given data point. Returning to the “roads” metaphor, if a city planner draws plans for a freeway, they don’t want the freeway to be too close to houses or other buildings. The more margin between the freeway and the buildings on either side, the better. The larger this margin, the more “confident” the classifier can be about its predictions. In the case of binary classification, drawing the correct hyperplane means choosing a hyperplane that is just in the middle of the two different classes. If the decision boundary/hyperplane is farther from one class, it will be closer to another. Therefore, the hyperplane must balance the margin between the two different classes.
Calculating The Separating Hyperplane From Support Vectors
So how does a support vector machine determine the best separating hyperplane/decision boundary? This is accomplished by calculating possible hyperplanes using a mathematical formula. We won’t cover the formula for calculating hyperplanes in extreme detail, but the line is calculated with the famous slope/line formula:
Y = ax + b
Meanwhile, lines are made out of points, which means any hyperplane can be described as: the set of points that run parallel to the proposed hyperplane, as determined by the weights of the model times the set of features modified by a specified offset/bias (“d”).
SVMs draw many hyperplanes. For example, the boundary line is one hyperplane, but the datapoints that the classifier considers are also on hyperplanes. The values for x are determined based on the features in the dataset. For instance, if you had a dataset with the heights and weights of many people, the “height” and “weight” features would be the features used to calculate the “X”. The margins between the proposed hyperplane and the various “support vectors” (datapoints) found on either side of the dividing hyperplane are calculated with the following formula:
W * X – b
While you can read more about the math behind SVMs, if you are looking for a more intuitive understanding of them just know that the goal is to maximize the distance between the proposed separating hyperplane/boundary line and the other hyperplanes that run parallel to it (and on which the data points are found).
The process described so far applies to binary classification tasks. However, SVM classifiers can also be used for non-binary classification tasks. When doing SVM classification on a dataset with three or more classes, more boundary lines are used. For example, if a classification task has three classes instead of two, two dividing lines will be used to divide up data points into classes and the region that comprises a single class will fall in between two dividing lines instead of one. Instead of just calculating the distance between just two classes and a decision boundary, the classifier must consider now the margins between the decision boundaries and the multiple classes within the dataset.
The process described above applies to cases where the data is linearly separable. Note that, in reality, datasets are almost never completely linearly separable, which means that when using an SVM classifier you will often need to use two different techniques: soft margin and kernel tricks. Consider a situation where data points of different classes are mixed together, with some instances belonging to one class in the “cluster” of another class. How could you have the classifier handle these instances?
One tactic that can be used to handle non-linearly separable datasets is the application of a “soft margin” SVM classifier. A soft margin classifier operates by accepting a few misclassified data points. It will try to draw a line that best separates the clusters of data points from each other, as they contain the majority of the instances belonging to their respective classes. The soft margin SVM classifier attempts to create a dividing line that balances the two demands of the classifier: accuracy and margin. It will try to minimize the misclassification while also maximizing the margin.
The SVM’s tolerance for error can be adjusted through manipulation of a hyperparameter called “C”. The C value controls how many support vectors the classifier considers when drawing decision boundaries. The C value is a penalty applied to misclassifications, meaning that the larger the C value the fewer support vectors the classifier takes into account and the narrower the margin.
The Kernel Trick operates by applying nonlinear transformations to the features in the dataset. The Kernel Trick takes the existing features in the dataset and creates new features through the application of nonlinear mathematical functions. What results from the application of these nonlinear transformations is a nonlinear decision boundary. Because the SVM classifier is no longer restricted to drawing linear decision boundaries it can start drawing curved decision boundaries that better encapsulate the true distribution of the support vectors and bring misclassifications to a minimum. Two of the most popular SVM nonlinear kernels are Radial Basis Function and Polynomial. The polynomial function creates polynomial combinations of all the existing features, while the Radial Basis Function generates new features by measuring the distance between a central point/points to all the other points.
What is Gradient Descent?
If you’ve read about how neural networks are trained, you’ve almost certainly come across the term “gradient descent” before. Gradient descent is the primary method of optimizing a neural network’s performance, reducing the network’s loss/error rate. However, gradient descent can be a little hard to understand for those new to machine learning, and this article will endeavor to give you a decent intuition for how gradient descent operates.
Gradient descent is an optimization algorithm. It’s used to improve the performance of a neural network by making tweaks to the parameters of the network such that the difference between the network’s predictions and the actual/expected values of the network (referred to as the loss) is a small as possible. Gradient descent takes the initial values of the parameters and uses operations based in calculus to adjust their values towards the values that will make the network as accurate as it can be. You don’t need to know a lot of calculus to understand how gradient descent works, but you do need to have an understanding of gradients.
What Are Gradients?
Assume that there is a graph that represents the amount of error a neural network makes. The bottom of the graph represents the points of lowest error while the top of the graph is where the error is the highest. We want to move from the top of the graph down to the bottom. A gradient is just a way of quantifying the relationship between error and the weights of the neural network. The relationship between these two things can be graphed as a slope, with incorrect weights producing more error. The steepness of the slope/gradient represents how fast the model is learning.
A steeper slope means large reductions in error are being made and the model is learning fast, whereas if the slope is zero the model is on a plateau and isn’t learning. We can move down the slope towards less error by calculating a gradient, a direction of movement (change in the parameters of the network) for our model.
Let’s shift the metaphor just slightly and imagine a series of hills and valleys. We want to get to the bottom of the hill and find the part of the valley that represents the lowest loss. When we start at the top of the hill we can take large steps down the hill and be confident that we are heading towards the lowest point in the valley.
However, as we get closer to the lowest point in the valley, our steps will need to become smaller, or else we could overshoot the true lowest point. Similarly, it’s possible that when adjusting the weights of the network, the adjustments can actually take it further away from the point of lowest loss, and therefore the adjustments must get smaller over time. In the context of descending a hill towards a point of lowest loss, the gradient is a vector/instructions detailing the path we should take and how large our steps should be.
Now we know that gradients are instructions that tell us which direction to move in (which coefficients should be updated) and how large the steps we should take are (how much the coefficients should be updated), we can explore how the gradient is calculated.
Calculating Gradients and Gradient Descent Procedure
In order to carry out gradient descent, the gradients must first be calculated. In order to calculate the gradient, we need to know the loss/cost function. We’ll use the cost function to determine the derivative. In calculus, the derivative just refers to the slope of a function at a given point, so we’re basically just calculating the slope of the hill based on the loss function. We determine the loss by running the coefficients through the loss function. If we represent the loss function as “f”, then we can state that the equation for calculating the loss is as follows (we’re just running the coefficients through our chosen cost function):
Loss = f(coefficient)
We then calculate the derivative, or determine the slope. Getting the derivative of the loss will tell us which direction is up or down the slope, by giving us the appropriate sign to adjust our coefficients by. We’ll represent the appropriate direction as “delta”.
delta = derivative_function(loss)
We’ve now determined which direction is downhill towards the point of lowest loss. This means we can update the coefficients in the neural network parameters and hopefully reduce the loss. We’ll update the coefficients based on the previous coefficients minus the appropriate change in value as determined by the direction (delta) and an argument that controls the magnitude of change (the size of our step). The argument that controls the size of the update is called the “learning rate” and we’ll represent it as “alpha”.
coefficient = coefficient – (alpha * delta)
We then just repeat this process until the network has converged around the point of lowest loss, which should be near zero.
It’s very important to choose the right value for the learning rate (alpha). The chosen learning rate must be neither too small or too large. Remember that as we approach the point of lowest loss our steps must become smaller or else we will overshoot the true point of lowest loss and end up on the other side. The point of smallest loss is small and if our rate of change is too large the error can end up increasing again. If the step sizes are too large the network’s performance will continue to bounce around the point of lowest loss, overshooting it on one side and then the other. If this happens the network will never converge on the true optimal weight configuration.
In contrast, if the learning rate is too small the network can potentially take an extraordinarily long time to converge on the optimal weights.
Types Of Gradient Descent
Now that we understand how gradient descent works in general, let’s take a look at some of the different types of gradient descent.
Batch Gradient Descent: This form of gradient descent runs through all the training samples before updating the coefficients. This type of gradient descent is likely to be the most computationally efficient form of gradient descent, as the weights are only updated once the entire batch has been processed, meaning there are fewer updates total. However, if the dataset contains a large number of training examples, then batch gradient descent can make training take a long time.
Stochastic Gradient Descent: In Stochastic Gradient Descent only a single training example is processed for every iteration of gradient descent and parameter updating. This occurs for every training example. Because only one training example is processed before the parameters are updated, it tends to converge faster than Batch Gradient Descent, as updates are made sooner. However, because the process must be carried out on every item in the training set, it can take quite a long time to complete if the dataset is large, and so use of one of the other gradient descent types if preferred.
Mini-Batch Gradient Descent: Mini-Batch Gradient Descent operates by splitting the entire training dataset up into subsections. It creates smaller mini-batches that are run through the network, and when the mini-batch has been used to calculate the error the coefficients are updated. Mini-batch Gradient Descent strikes a middle ground between Stochastic Gradient Descent and Batch Gradient Descent. The model is updated more frequently than in the case of Batch Gradient Descent, which means a slightly faster and more robust convergence on the model’s optimal parameters. It’s also more computationally efficient than Stochastic Gradient Descent