One of the fastest-growing areas of research in machine learning is the area of meta-learning. Meta-learning, in the machine learning context, is the use of machine learning algorithms to assist in the training and optimization of other machine learning models. As meta-learning is becoming more and more popular and more meta-learning techniques are being developed, it’s beneficial to have an understanding of what meta-learning is and to have a sense of the various ways it can be applied. Let’s examine the ideas behind meta-learning, types of meta-learning, as well as some of the ways meta-learning can be used.
The term meta-learning was coined by Donald Maudsley to describe a process by which people begin to shape what they learn, becoming “increasingly in control of habits of perception, inquiry, learning, and growth that they have internalized”. Later, cognitive scientists and psychologists would describe meta-learning as “learning how to learn”.
For the machine learning version of meta-learning, the general idea of “learning how to learn” is applied to AI systems. In the AI sense, meta-learning is the ability of an artificially intelligent machine to learn how to carry out various complex tasks, taking the principles it used to learn one task and applying it to other tasks. AI systems typically have to be trained to accomplish a task through the mastering of many small subtasks. This training can take a long time and AI agents don’t easily transfer the knowledge learned during one task to another task. Creating meta-learning models and techniques can help AI learn to generalize learning methods and acquire new skills quicker.
Types of Meta-Learning
Meta-learning is often employed to optimize the performance of an already existing neural network. Optimizer meta-learning methods typically function by tweaking the hyperparameters of a different neural network in order to improve the performance of the base neural network. The result is that the target network should become better at performing the task it is being trained on. One example of a meta-learning optimizer is the use of a network to improve gradient descent results.
A few-shots meta-learning approach is one where a deep neural network is engineered which is capable of generalizing from the training datasets to unseen datasets. An instance of few-shot classification is similar to a normal classification task, but instead, the data samples are entire datasets. The model is trained on many different learning tasks/datasets and then it’s optimized for peak performance on the multitude of training tasks and unseen data. In this approach, a single training sample is split up into multiple classes. This means that each training sample/dataset could potentially be made up of two classes, for a total of 4-shots. In this case, the total training task could be described as a 4-shot 2-class classification task.
In few-shot learning, the idea is that the individual training samples are minimalistic and that the network can learn to identify objects after having seen just a few pictures. This is much like how a child learns to distinguish objects after seeing just a couple of pictures. This approach has been used to create techniques like one-shot generative models and memory augmented neural networks.
Metric based meta-learning is the utilization of neural networks to determine if a metric is being used effectively and if the network or networks are hitting the target metric. Metric meta-learning is similar to few-shot learning in that just a few examples are used to train the network and have it learn the metric space. The same metric is used across the diverse domain and if the networks diverge from the metric they are considered to be failing.
Recurrent Model Meta-Learning
Recurrent model meta-learning is the application of meta-learning techniques to Recurrent Neural Networks and the similar Long Short-Term Memory networks. This technique operates by training the RNN/LSTM model to sequentially learn a dataset and then using this trained model as a basis for another learner. The meta-learner takes on board the specific optimization algorithm that was used to train the initial model. The inherited parameterization of the meta-learner enables it to quickly initialize and converge, but still be able to update for new scenarios.
How Does Meta-Learning Work?
The exact way that meta-learning is conducted varies depending on the model and the nature of the task at hand. However, in general, a meta-learning task involves copying over the parameters of the first network into the parameters of the second network/the optimizer.
There are two training processes in meta-learning. The meta-learning model is typically trained after several steps of training on the base model have been carried out. After the forward, backward, and optimization steps that train the base model, the forward training pass is carried out for the optimization model. For example, after three or four steps of training on the base model, a meta-loss is computed. After the meta-loss is computed, the gradients are computed for each meta-parameter. After this occurs, the meta-parameters in the optimizer are updated.
One possibility for calculating the meta-loss is to finish the forward training pass of the initial model and then combine the losses that have already been computed. The meta-optimizer could even be another meta-learner, though at a certain point a discrete optimizer like ADAM or SGD must be used.
Many deep learning models can have hundreds of thousands or even millions of parameters. Creating a meta-learner that has an entirely new set of parameters would be computationally expensive, and for this reason, a tactic called coordinate-sharing is typically used. Coordinate-sharing involves engineering the meta-learner/optimizer so that it learns a single parameter from the base model and then just clones that parameter in place of all of the other parameters. The result is that the parameters the optimizer possesses don’t depend on the parameters of the model.
What Is K-Nearest Neighbors?
K-Nearest Neighbors is a machine learning technique and algorithm that can be used for both regression and classification tasks. K-Nearest Neighbors examines the labels of a chosen number of data points surrounding a target data point, in order to make a prediction about the class that the data point falls into. K-Nearest Neighbors (KNN) is a conceptually simple yet very powerful algorithm, and for those reasons, it’s one of the most popular machine learning algorithms. Let’s take a deep dive into the KNN algorithm and see exactly how it works. Having a good understanding of how KNN operates will let you appreciated the best and worst use cases for KNN.
An Overview Of KNN
Let’s visualize a dataset on a 2D plane. Picture a bunch of data points on a graph, spread out along the graph in small clusters. KNN examines the distribution of the data points and, depending on the arguments given to the model, it separates the data points into groups. These groups are then assigned a label. The primary assumption that a KNN model makes is that data points/instances which exist in close proximity to each other are highly similar, while if a data point is far away from another group it’s dissimilar to those data points.
A KNN model calculates similarity using the distance between two points on a graph. The greater the distance between the points, the less similar they are. There are multiple ways of calculating the distance between points, but the most common distance metric is just Euclidean distance (the distance between two points in a straight line).
KNN is a supervised learning algorithm, meaning that the examples in the dataset must have labels assigned to them/their classes must be known. There are two other important things to know about KNN. First, KNN is a non-parametric algorithm. This means that no assumptions about the dataset are made when the model is used. Rather, the model is constructed entirely from the provided data. Second, there is no splitting of the dataset into training and test sets when using KNN. KNN makes no generalizations between a training and testing set, so all the training data is also used when the model is asked to make predictions.
How The KNN Algorithm Operates
A KNN algorithm goes through three main phases as it is carried out:
- Setting K to the chosen number of neighbors.
- Calculating the distance between a provided/test example and the dataset examples.
- Sorting the calculated distances.
- Getting the labels of the top K entries.
- Returning a prediction about the test example.
In the first step, K is chosen by the user and it tells the algorithm how many neighbors (how many surrounding data points) should be considered when rendering a judgment about the group the target example belongs to. In the second step, note that the model checks the distance between the target example and every example in the dataset. The distances are then added into a list and sorted. Afterward, the sorted list is checked and the labels for the top K elements are returned. In other words, if K is set to 5, the model checks the labels of the top 5 closest data points to the target data point. When rendering a prediction about the target data point, it matters if the task is a regression or classification task. For a regression task, the mean of the top K labels is used, while the mode of the top K labels is used in the case of classification.
The exact mathematical operations used to carry out KNN differ depending on the chosen distance metric. If you would like to learn more about how the metrics are calculated, you can read about some of the most common distance metrics, such as Euclidean, Manhattan, and Minkowski.
Why The Value Of K Matters
The main limitation when using KNN is that in an improper value of K (the wrong number of neighbors to be considered) might be chosen. If this happen, the predictions that are returned can be off substantially. It’s very important that, when using a KNN algorithm, the proper value for K is chosen. You want to choose a value for K that maximizes the model’s ability to make predictions on unseen data while reducing the number of errors it makes.
Lower values of K mean that the predictions rendered by the KNN are less stable and reliable. To get an intuition of why this is so, consider a case where we have 7 neighbors around a target data point. Let’s assume that the KNN model is working with a K value of 2 (we’re asking it to look at the two closest neighbors to make a prediction). If the vast majority of the neighbors (five out of seven) belong to the Blue class, but the two closest neighbors just happen to be Red, the model will predict that the query example is Red. Despite the model’s guess, in such a scenario Blue would be a better guess.
If this is the case, why not just choose the highest K value we can? This is because telling the model to consider too many neighbors will also reduce accuracy. As the radius that the KNN model considers increases, it will eventually start considering data points that are closer to other groups than they are the target data point and misclassification will start occurring. For example, even if the point that was initially chosen was in one of the red regions above, if K was set too high, the model would reach into the other regions to consider points. When using a KNN model, different values of K are tried to see which value gives the model the best performance.
KNN Pros And Cons
Let’s examine some of the pros and cons of the KNN model.
KNN can be used for both regression and classification tasks, unlike some other supervised learning algorithms.
KNN is highly accurate and simple to use. It’s easy to interpret, understand, and implement.
KNN doesn’t make any assumptions about the data, meaning it can be used for a wide variety of problems.
KNN stores most or all of the data, which means that the model requires a lot of memory and its computationally expensive. Large datasets can also cause predictions to be take a long time.
KNN proves to be very sensitive to the scale of the dataset and it can be thrown off by irrelevant features fairly easily in comparison to other models.
K-Nearest Neighbors is one of the simplest machine learning algorithms. Despite how simple KNN is, in concept, it’s also a powerful algorithm that gives fairly high accuracy on most problems. When you use KNN, be sure to experiment with various values of K in order to find the number that provides the highest accuracy.
What is Linear Regression?
Linear regression is an algorithm used to predict, or visualize, a relationship between two different features/variables. In linear regression tasks, there are two kinds of variables being examined: the dependent variable and the independent variable. The independent variable is the variable that stands by itself, not impacted by the other variable. As the independent variable is adjusted, the levels of the dependent variable will fluctuate. The dependent variable is the variable that is being studied, and it is what the regression model solves for/attempts to predict. In linear regression tasks, every observation/instance is comprised of both the dependent variable value and the independent variable value.
That was a quick explanation of linear regression, but let’s make sure we come to a better understanding of linear regression by looking at an example of it and examining the formula that it uses.
Understanding Linear Regression
Assume that we have a dataset covering hard-drive sizes and the cost of those hard drives.
Let’s suppose that the dataset we have is comprised of two different features: the amount of memory and cost. The more memory we purchase for a computer, the more the cost of the purchase goes up. If we plotted out the individual data points on a scatter plot, we might get a graph that looks something like this:
The exact memory-to-cost ratio might vary between manufacturers and models of hard drive, but in general, the trend of the data is one that starts in the bottom left (where hard drives are both cheaper and have smaller capacity) and moves to the upper right (where the drives are more expensive and have higher capacity).
If we had the amount of memory on the X-axis and the cost on the Y-axis, a line capturing the relationship between the X and Y variables would start in the lower-left corner and run to the upper right.
The function of a regression model is to determine a linear function between the X and Y variables that best describes the relationship between the two variables. In linear regression, it’s assumed that Y can be calculated from some combination of the input variables. The relationship between the input variables (X) and the target variables (Y) can be portrayed by drawing a line through the points in the graph. The line represents the function that best describes the relationship between X and Y (for example, for every time X increases by 3, Y increases by 2). The goal is to find an optimal “regression line”, or the line/function that best fits the data.
Lines are typically represented by the equation: Y = m*X + b. X refers to the dependent variable while Y is the independent variable. Meanwhile, m is the slope of the line, as defined by the “rise” over the “run”. Machine learning practitioners represent the famous slope-line equation a little differently, using this equation instead:
y(x) = w0 + w1 * x
In the above equation, y is the target variable while “w” is the model’s parameters and the input is “x”. So the equation is read as: “The function that gives Y, depending on X, is equal to the parameters of the model multiplied by the features”. The parameters of the model are adjusted during training to get the best-fit regression line.
The process described above applies to simple linear regression, or regression on datasets where there is only a single feature/independent variable. However, a regression can also be done with multiple features. In the case of “multiple linear regression”, the equation is extended by the number of variables found within the dataset. In other words, while the equation for regular linear regression is y(x) = w0 + w1 * x, the equation for multiple linear regression would be y(x) = w0 + w1x1 plus the weights and inputs for the various features. If we represent the total number of weights and features as w(n)x(n), then we could represent the formula like this:
y(x) = w0 + w1x1 + w2x2 + … + w(n)x(n)
After establishing the formula for linear regression, the machine learning model will use different values for the weights, drawing different lines of fit. Remember that the goal is to find the line that best fits the data in order to determine which of the possible weight combinations (and therefore which possible line) best fits the data and explains the relationship between the variables.
A cost function is used to measure how close the assumed Y values are to the actual Y values when given a particular weight value. The cost function for linear regression is mean squared error, which just takes the average (squared) error between the predicted value and the true value for all of the various data points in the dataset. The cost function is used to calculate a cost, which captures the difference between the predicted target value and the true target value. If the fit line is far from the data points, the cost will be higher, while the cost will become smaller the closer the line gets to capturing the true relationships between variables. The weights of the model are then adjusted until the weight configuration that produces the smallest amount of error is found.
What are Support Vector Machines?
Support vector machines are a type of machine learning classifier, arguably one of the most popular kinds of classifiers. Support vector machines are especially useful for numerical prediction, classification, and pattern recognition tasks.
Support vector machines operate by drawing decision boundaries between data points, aiming for the decision boundary that best separates the data points into classes (or is the most generalizable). The goal when using a support vector machine is that the decision boundary between the points is as large as possible so that the distance between any given data point and the boundary line is maximized. That’s a quick explanation of how support vector machines (SVMs) operate, but let’s take some time to delve deeper into how SVMs operate and understand the logic behind their operation.
The Goal Of Support Vector Machines
Imagine a graph with a number of data points on it, based on features specified by the X and Y axes. The data points on the graph can loosely be divided up into two different clusters, and the cluster that a data point belongs to indicates the class of the data point. Now assume that we want to draw a line down the graph that separates the two classes from each other, with all the data points in one class found on one side of the line and all the data points belonging to another class found on the other side of the line. This separating line is known as a hyperplane.
You can think of a support vector machine as creating “roads” throughout a city, separating the city into districts on either side of the road. All the buildings (data points) that are found on one side of the road belong to one district.
The goal of a support vector machine is not only to draw hyperplanes and divide data points, but to draw the hyperplane the separates data points with the largest margin, or with the most space between the dividing line and any given data point. Returning to the “roads” metaphor, if a city planner draws plans for a freeway, they don’t want the freeway to be too close to houses or other buildings. The more margin between the freeway and the buildings on either side, the better. The larger this margin, the more “confident” the classifier can be about its predictions. In the case of binary classification, drawing the correct hyperplane means choosing a hyperplane that is just in the middle of the two different classes. If the decision boundary/hyperplane is farther from one class, it will be closer to another. Therefore, the hyperplane must balance the margin between the two different classes.
Calculating The Separating Hyperplane From Support Vectors
So how does a support vector machine determine the best separating hyperplane/decision boundary? This is accomplished by calculating possible hyperplanes using a mathematical formula. We won’t cover the formula for calculating hyperplanes in extreme detail, but the line is calculated with the famous slope/line formula:
Y = ax + b
Meanwhile, lines are made out of points, which means any hyperplane can be described as: the set of points that run parallel to the proposed hyperplane, as determined by the weights of the model times the set of features modified by a specified offset/bias (“d”).
SVMs draw many hyperplanes. For example, the boundary line is one hyperplane, but the datapoints that the classifier considers are also on hyperplanes. The values for x are determined based on the features in the dataset. For instance, if you had a dataset with the heights and weights of many people, the “height” and “weight” features would be the features used to calculate the “X”. The margins between the proposed hyperplane and the various “support vectors” (datapoints) found on either side of the dividing hyperplane are calculated with the following formula:
W * X – b
While you can read more about the math behind SVMs, if you are looking for a more intuitive understanding of them just know that the goal is to maximize the distance between the proposed separating hyperplane/boundary line and the other hyperplanes that run parallel to it (and on which the data points are found).
The process described so far applies to binary classification tasks. However, SVM classifiers can also be used for non-binary classification tasks. When doing SVM classification on a dataset with three or more classes, more boundary lines are used. For example, if a classification task has three classes instead of two, two dividing lines will be used to divide up data points into classes and the region that comprises a single class will fall in between two dividing lines instead of one. Instead of just calculating the distance between just two classes and a decision boundary, the classifier must consider now the margins between the decision boundaries and the multiple classes within the dataset.
The process described above applies to cases where the data is linearly separable. Note that, in reality, datasets are almost never completely linearly separable, which means that when using an SVM classifier you will often need to use two different techniques: soft margin and kernel tricks. Consider a situation where data points of different classes are mixed together, with some instances belonging to one class in the “cluster” of another class. How could you have the classifier handle these instances?
One tactic that can be used to handle non-linearly separable datasets is the application of a “soft margin” SVM classifier. A soft margin classifier operates by accepting a few misclassified data points. It will try to draw a line that best separates the clusters of data points from each other, as they contain the majority of the instances belonging to their respective classes. The soft margin SVM classifier attempts to create a dividing line that balances the two demands of the classifier: accuracy and margin. It will try to minimize the misclassification while also maximizing the margin.
The SVM’s tolerance for error can be adjusted through manipulation of a hyperparameter called “C”. The C value controls how many support vectors the classifier considers when drawing decision boundaries. The C value is a penalty applied to misclassifications, meaning that the larger the C value the fewer support vectors the classifier takes into account and the narrower the margin.
The Kernel Trick operates by applying nonlinear transformations to the features in the dataset. The Kernel Trick takes the existing features in the dataset and creates new features through the application of nonlinear mathematical functions. What results from the application of these nonlinear transformations is a nonlinear decision boundary. Because the SVM classifier is no longer restricted to drawing linear decision boundaries it can start drawing curved decision boundaries that better encapsulate the true distribution of the support vectors and bring misclassifications to a minimum. Two of the most popular SVM nonlinear kernels are Radial Basis Function and Polynomial. The polynomial function creates polynomial combinations of all the existing features, while the Radial Basis Function generates new features by measuring the distance between a central point/points to all the other points.