What is Dimensionality Reduction?
Dimensionality reduction is a process used to reduce the dimensionality of a dataset, taking many features and representing them as fewer features. For example, dimensionality reduction could be used to reduce a dataset of twenty features down to just a few features. Dimensionality reduction is commonly used in unsupervised learning tasks to automatically create classes out of many features. In order to better understand why and how dimensionality reduction is used, we’ll take a look at the problems associated with high dimensional data and the most popular methods of reducing dimensionality.
More Dimensions Leads to Overfitting
Dimensionality refers to the number of features/columns within a dataset.
It’s often assumed that in machine learning more features are better, as it creates a more accurate model. However, more features don’t necessarily translate to a better model.
The features of a dataset can vary widely in terms of how useful they are to the model, with many features being of little importance. In addition, the more features the dataset contains, the more samples are needed to ensure that the different combinations of features are well represented within the data. Therefore, the number of samples increases in proportion with the number of features. More samples and more features mean that the model needs to be more complex, and as models become more complex they become more sensitive to overfitting. The model learns the patterns in the training data too well and it fails to generalize to out of sample data.
Reducing the dimensionality of a dataset has several benefits. As mentioned, simpler models are less prone to overfitting, as the model has to make fewer assumptions regarding how features are related to one another. In addition, fewer dimensions mean less computing power is required to train the algorithms. Similarly, less storage space is needed for a dataset that has smaller dimensionality. Reducing the dimensionality of a dataset can also let you use algorithms that are unsuited for datasets with many features.
Common Dimensionality Reduction Methods
Dimensionality reduction can be by feature selection or feature engineering. Feature selection is where the engineer identifies the most relevant features of the dataset, while feature engineering is the process of creating new features by combining or transforming other features.
Feature selection and engineering can be done programmatically or manually. When manually selecting and engineering features, visualizing the data to discover correlations between features and classes is typical. Carrying out dimensionality reduction in this way can be quite time intensive and therefore some of the most common ways of reducing dimensionality involve the use of algorithms available in libraries like Scikit-learn for Python. These common dimensionality reduction algorithms include: Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and Linear Discriminant Analysis (LDA).
The algorithms used in dimensionality reduction for unsupervised learning tasks are typically PCA and SVD, while those leveraged for supervised learning dimensionality reduction are typically LDA and PCA. In the case of supervised learning models, the newly generated features are just fed into the machine learning classifier. Take note that the uses described here are just general use cases and not the only conditions that these techniques may be used in. The dimensionality reduction algorithms described above are simply statistical methods and they are used outside of machine learning models.
Principal Component Analysis
Principal Component Analysis (PCA) is a statistical method that analyzes the characteristics/features of a dataset and summarizes the features that are the most influential. The features of the dataset are combined together into representations that maintain most of the characteristics of the data but are spread across fewer dimensions. You can think of this as “squishing” the data down from a higher dimension representation to one with just a few dimensions.
As an example of a situation where PCA might be useful, think about the various ways one could describe wine. While it is possible to describe wine using many highly specific features like CO2 levels, aeration levels, etc., such specific features may be relatively useless when trying to identify a specific type of wine. Instead, it would be more prudent to identify the type based on more general features like taste, color, and age. PCA can be used to combine more specific features and create features that are more general, useful, and less likely to cause overfitting.
PCA is carried out by determining how the input features vary from the mean with respect to each other, determining if any relationships exist between the features. In order to do this, a covariant matrix is created, establishing a matrix composed of the covariances with respect to the possible pairs of the dataset features. This is used to determine correlations between the variables, with a negative covariance indicating an inverse correlation and a positive correlation indicating a positive correlation.
The principal (most influential) components of the dataset are created by creating linear combinations of the initial variables, which is done with the assistance of linear algebra concepts called eigenvalues and eigenvectors. The combinations are created so that the principal components are uncorrelated with each other. Most of the information contained in the initial variables is compressed into the first few principal components, meaning new features (the principal components) have been created that contain the information from the original dataset in a smaller dimensional space.
Singular Value Decomposition
Singular Value Decomposition (SVD) is used to simplify the values within a matrix, reducing the matrix down to its constituent parts and making calculations with that matrix easier. SVD can be utilized for both real-value and complex matrices, but for the purposes of this explanation, will examine how to decompose a matrix of real values.
Assume that we have a matrix composed of real-value data and our goal is to reduce the number of columns/features within the matrix, similar to the goal of PCA. Like PCA, SVD will compress the dimensionality of the matrix while preserving as much of the matrix’s variability as possible. If we want to operate on matrix A, we can represent matrix A as three other matrices called U, D, & V. Matrix A is comprised of the original x * y elements while matrix U is comprised of elements X * X (it is an orthogonal matrix). Matrix V is a different orthogonal matrix containing y * y elements. Matrix D contains the elements x * y and it’s a diagonal matrix.
In order to decompose the values for matrix A, we need to convert the original singular matrix values to the diagonal values found within a new matrix. When working with orthogonal matrices, their properties do not change if they are multiplied by other numbers. Therefore, we can approximate matrix A by taking advantage of this property. When we multiply the orthogonal matrices together with a transpose of Matrix V, the result is an equivalent matrix to our original A.
When Matrix a is decomposed down into matrices U, D, and V, they contain the data found within Matrix A. However, the leftmost columns of the matrices will hold the majority of the data. We can take just these first few columns and have a representation of Matrix A that has far fewer dimensions and most of the data within A.
Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is a process that takes data from a multidimensional graph and reprojects it onto a linear graph. You can envision this by thinking of a two-dimensional graph filled with data points belonging to two different classes. Assume that the points are scattered around so that no line can be drawn that will neatly separate the two different classes. In order to handle this situation, the points found in the 2D graph can be reduced down to a 1D graph (a line). This line will have all the data points distributed across it and it can hopefully be divided into two sections that represent the best possible separation of the data.
When carrying out LDA there are two primary goals. The first goal is minimizing the variance for the classes, while the second goal is maximizing the distance between the means of the two classes. These goals are accomplished by creating a new axis that will exist in the 2D graph. The newly created axis acts to separate the two classes based on the goals previously described. After the axis has been created, the points found in the 2D graph are placed along the axis.
There are three steps required to move the original points to a new position along the new axis. In the first step, the distance between the individual classes means (the between-class variance) is used to calculate the separability of the classes. In the second step, the variance within the different classes is calculated, done by determining the distance between the sample and mean for the class in question. In the final step, the lower-dimensional space that maximizes the variance between classes is created.
The LDA technique achieves the best results when the means for the target classes are far apart from each other. LDA can’t effectively separate the classes with a linear axis if the means for the distributions overlap.