### AI 101

# What is Big Data?

“Big Data” is one of the commonly used buzz words of our current era, but what does it really mean?

Here’s a quick, simple definition of big data. Big data is data that is too large and complex to be handled by traditional data processing and storage methods. While that’s a quick definition you can use as a heuristic, it would be helpful to have a deeper, more complete understanding of big data. Let’s take a look at some of the concepts that underlie big data, like storage, structure, and processing.

**How Big Is Big Data?**

It isn’t as simple as saying “any data over the size ‘X ‘is big data”, the environment that the data is being handled in is an extremely important factor in determining what qualifies as big data. The size that data needs to be, in order to be considered big data, is dependant upon the context, or the task the data is being used in. Two datasets of vastly different sizes can be considered “big data” in different contexts.

To be more concrete, if you try to send a 200-megabyte file as an email attachment, you would not be able to do so. In this context, the 200-megabyte file could be considered big data. In contrast, copying a 200-megabyte file to another device within the same LAN may not take any time at all, and in that context, it wouldn’t be regarded as big data.

However, let’s assume that 15 terabytes worth of video need to be pre-processed for use in training computer vision applications. In this case, the video files take up so much space that even a powerful computer would take a long time to process them all, and so the processing would normally be distributed across multiple computers linked together in order to decrease processing time. These 15 terabytes of video data would definitely qualify as big data.

**Types Of Big Data Structures**

Big data comes in three different categories of structure: un-structured data, semi-structured, and structured data.

Unstructured data is data that possesses no definable structure, meaning the data is essentially just in one large pool. Examples of unstructured data would be a database full of unlabeled images.

Semi-structured data is data that doesn’t have a formal structure, but does exist within a loose structure. For example, email data migtht count as semi-structured data, because you could refer to the data contained in individual emails, but formal data patterns have not been established.

Structured data is data that has a formal structure, with data points categorized by different features. One example of structured data is an excel spreadsheet containing contact information like names, emails, phone numbers, and websites.

If you would like to read more about the differences in these data types, check the link here.

**Metrics For Assessing Big Data**

Big data can be analyzed in terms of three different metrics: volume, velocity, and variety.

Volume refers to the size of the data. The average size of datasets is often increasing. For example, the largest hard drive in 2006 was a 750 GB hard drive. In contrast, Facebook is thought to generate over 500 terabytes of data in a day and the largest consumer hard drive available today is a 16 terabyte hard drive. What quantifies as big data in one era may not be big data in another. More data is generated today because more and more of the objects surrounding us are equipped with sensors, cameras, microphones, and other data collection devices.

Velocity refers to how fast data is moving, or to put that another way, how much data is generated within a given period of time. Social media streams generate hundreds of thousands of posts and comments every minute, while your own email inbox will probably have much less activity. Big data streams are streams that often handle hundreds of thousands or millions of events in more or less real-time. Examples of these data streams are online gaming platforms and high-frequency stock trading algorithms.

Variety refers to the different types of data contained within the dataset. Data can be made up of many different formats, like audio, video, text, photos, or serial numbers. In general, traditional databases are formatted to handle one, or just a couple, types of data. To put that another way, traditional databases are structured to hold data that is fairly homogenous and of a consistent, predictable structure. As applications become more diverse, full of different features, and used by more people, databases have had to evolve to store more types of data. Unstructured databases are ideal for holding big data, as they can hold multiple data types that aren’t related to each other.

**Methods Of Handling Big Data**

There are a number of different platforms and tools designed to facilitate the analysis of big data. Big data pools need to be analyzed to extract meaningful patterns from the data, a task that can prove quite challenging with traditional data analysis tools. In response to the need for tools to analyze large volumes of data, a variety of companies have created big data analysis tools. Big data analysis tools include systems like ZOHO Analytics, Cloudera, and Microsoft BI.

### AI 101

# What is Linear Regression?

Linear regression is an algorithm used to predict, or visualize, a relationship between two different features/variables. In linear regression tasks, there are two kinds of variables being examined: the dependent variable and the independent variable. The independent variable is the variable that stands by itself, not impacted by the other variable. As the independent variable is adjusted, the levels of the dependent variable will fluctuate. The dependent variable is the variable that is being studied, and it is what the regression model solves for/attempts to predict. In linear regression tasks, every observation/instance is comprised of both the dependent variable value and the independent variable value.

That was a quick explanation of linear regression, but let’s make sure we come to a better understanding of linear regression by looking at an example of it and examining the formula that it uses.

## Understanding Linear Regression

Assume that we have a dataset covering hard-drive sizes and the cost of those hard drives.

Let’s suppose that the dataset we have is comprised of two different features: the amount of memory and cost. The more memory we purchase for a computer, the more the cost of the purchase goes up. If we plotted out the individual data points on a scatter plot, we might get a graph that looks something like this:

The exact memory-to-cost ratio might vary between manufacturers and models of hard drive, but in general, the trend of the data is one that starts in the bottom left (where hard drives are both cheaper and have smaller capacity) and moves to the upper right (where the drives are more expensive and have higher capacity).

If we had the amount of memory on the X-axis and the cost on the Y-axis, a line capturing the relationship between the X and Y variables would start in the lower-left corner and run to the upper right.

The function of a regression model is to determine a linear function between the X and Y variables that best describes the relationship between the two variables. In linear regression, it’s assumed that Y can be calculated from some combination of the input variables. The relationship between the input variables (X) and the target variables (Y) can be portrayed by drawing a line through the points in the graph. The line represents the function that best describes the relationship between X and Y (for example, for every time X increases by 3, Y increases by 2). The goal is to find an optimal “regression line”, or the line/function that best fits the data.

Lines are typically represented by the equation: Y = m*X + b. X refers to the dependent variable while Y is the independent variable. Meanwhile, m is the slope of the line, as defined by the “rise” over the “run”. Machine learning practitioners represent the famous slope-line equation a little differently, using this equation instead:

y(x) = w0 + w1 * x

In the above equation, y is the target variable while “w” is the model’s parameters and the input is “x”. So the equation is read as: “The function that gives Y, depending on X, is equal to the parameters of the model multiplied by the features”. The parameters of the model are adjusted during training to get the best-fit regression line.

## Multiple Regression

The process described above applies to simple linear regression, or regression on datasets where there is only a single feature/independent variable. However, a regression can also be done with multiple features. In the case of “multiple linear regression”, the equation is extended by the number of variables found within the dataset. In other words, while the equation for regular linear regression is y(x) = w0 + w1 * x, the equation for multiple linear regression would be y(x) = w0 + w1x1 plus the weights and inputs for the various features. If we represent the total number of weights and features as w(n)x(n), then we could represent the formula like this:

y(x) = w0 + w1x1 + w2x2 + … + w(n)x(n)

After establishing the formula for linear regression, the machine learning model will use different values for the weights, drawing different lines of fit. Remember that the goal is to find the line that best fits the data in order to determine which of the possible weight combinations (and therefore which possible line) best fits the data and explains the relationship between the variables.

A cost function is used to measure how close the assumed Y values are to the actual Y values when given a particular weight value. The cost function for linear regression is mean squared error, which just takes the average (squared) error between the predicted value and the true value for all of the various data points in the dataset. The cost function is used to calculate a cost, which captures the difference between the predicted target value and the true target value. If the fit line is far from the data points, the cost will be higher, while the cost will become smaller the closer the line gets to capturing the true relationships between variables. The weights of the model are then adjusted until the weight configuration that produces the smallest amount of error is found.

### AI 101

# What are Support Vector Machines?

Support vector machines are a type of machine learning classifier, arguably one of the most popular kinds of classifiers. Support vector machines are especially useful for numerical prediction, classification, and pattern recognition tasks.

Support vector machines operate by drawing decision boundaries between data points, aiming for the decision boundary that best separates the data points into classes (or is the most generalizable). The goal when using a support vector machine is that the decision boundary between the points is as large as possible so that the distance between any given data point and the boundary line is maximized. That’s a quick explanation of how support vector machines (SVMs) operate, but let’s take some time to delve deeper into how SVMs operate and understand the logic behind their operation.

## The Goal Of Support Vector Machines

Imagine a graph with a number of data points on it, based on features specified by the X and Y axes. The data points on the graph can loosely be divided up into two different clusters, and the cluster that a data point belongs to indicates the class of the data point. Now assume that we want to draw a line down the graph that separates the two classes from each other, with all the data points in one class found on one side of the line and all the data points belonging to another class found on the other side of the line. This separating line is known as a hyperplane.

You can think of a support vector machine as creating “roads” throughout a city, separating the city into districts on either side of the road. All the buildings (data points) that are found on one side of the road belong to one district.

The goal of a support vector machine is not only to draw hyperplanes and divide data points, but to draw the hyperplane the separates data points with the largest margin, or with the most space between the dividing line and any given data point. Returning to the “roads” metaphor, if a city planner draws plans for a freeway, they don’t want the freeway to be too close to houses or other buildings. The more margin between the freeway and the buildings on either side, the better. The larger this margin, the more “confident” the classifier can be about its predictions. In the case of binary classification, drawing the correct hyperplane means choosing a hyperplane that is just in the middle of the two different classes. If the decision boundary/hyperplane is farther from one class, it will be closer to another. Therefore, the hyperplane must balance the margin between the two different classes.

## Calculating The Separating Hyperplane From Support Vectors

So how does a support vector machine determine the best separating hyperplane/decision boundary? This is accomplished by calculating possible hyperplanes using a mathematical formula. We won’t cover the formula for calculating hyperplanes in extreme detail, but the line is calculated with the famous slope/line formula:

Y = ax + b

Meanwhile, lines are made out of points, which means any hyperplane can be described as: the set of points that run parallel to the proposed hyperplane, as determined by the weights of the model times the set of features modified by a specified offset/bias (“d”).

SVMs draw many hyperplanes. For example, the boundary line is one hyperplane, but the datapoints that the classifier considers are also on hyperplanes. The values for x are determined based on the features in the dataset. For instance, if you had a dataset with the heights and weights of many people, the “height” and “weight” features would be the features used to calculate the “X”. The margins between the proposed hyperplane and the various “support vectors” (datapoints) found on either side of the dividing hyperplane are calculated with the following formula:

W * X – b

While you can read more about the math behind SVMs, if you are looking for a more intuitive understanding of them just know that the goal is to maximize the distance between the proposed separating hyperplane/boundary line and the other hyperplanes that run parallel to it (and on which the data points are found).

## Multiclass Classification

The process described so far applies to binary classification tasks. However, SVM classifiers can also be used for non-binary classification tasks. When doing SVM classification on a dataset with three or more classes, more boundary lines are used. For example, if a classification task has three classes instead of two, two dividing lines will be used to divide up data points into classes and the region that comprises a single class will fall in between two dividing lines instead of one. Instead of just calculating the distance between just two classes and a decision boundary, the classifier must consider now the margins between the decision boundaries and the multiple classes within the dataset.

## Non-Linear Separations

The process described above applies to cases where the data is linearly separable. Note that, in reality, datasets are almost never completely linearly separable, which means that when using an SVM classifier you will often need to use two different techniques: soft margin and kernel tricks. Consider a situation where data points of different classes are mixed together, with some instances belonging to one class in the “cluster” of another class. How could you have the classifier handle these instances?

One tactic that can be used to handle non-linearly separable datasets is the application of a “soft margin” SVM classifier. A soft margin classifier operates by accepting a few misclassified data points. It will try to draw a line that best separates the clusters of data points from each other, as they contain the majority of the instances belonging to their respective classes. The soft margin SVM classifier attempts to create a dividing line that balances the two demands of the classifier: accuracy and margin. It will try to minimize the misclassification while also maximizing the margin.

The SVM’s tolerance for error can be adjusted through manipulation of a hyperparameter called “C”. The C value controls how many support vectors the classifier considers when drawing decision boundaries. The C value is a penalty applied to misclassifications, meaning that the larger the C value the fewer support vectors the classifier takes into account and the narrower the margin.

The Kernel Trick operates by applying nonlinear transformations to the features in the dataset. The Kernel Trick takes the existing features in the dataset and creates new features through the application of nonlinear mathematical functions. What results from the application of these nonlinear transformations is a nonlinear decision boundary. Because the SVM classifier is no longer restricted to drawing linear decision boundaries it can start drawing curved decision boundaries that better encapsulate the true distribution of the support vectors and bring misclassifications to a minimum. Two of the most popular SVM nonlinear kernels are Radial Basis Function and Polynomial. The polynomial function creates polynomial combinations of all the existing features, while the Radial Basis Function generates new features by measuring the distance between a central point/points to all the other points.

### AI 101

# What Is Overfitting?

When you train a neural network, you have to avoid overfitting. Overfitting is an issue within machine learning and statistics where a model learns the patterns of a training dataset too well, perfectly explaining the training data set but failing to generalize its predictive power to other sets of data. To put that another way, in the case of an overfitting model it will often show extremely high accuracy on the training dataset but low accuracy on data collected and run through the model in the future. That’s a quick definition of overfitting, but let’s go over the concept of overfitting in more detail. Let’s take a look at how overfitting occurs and how it can be avoided.

## Understanding “Fit” and Overfitting

Before we delve too deeply into overfitting, it might be helpful to take a look at the concept of underfitting and “fit” generally. When we train a model we are trying to develop a framework that is capable of predicting the nature, or class, of items within a dataset, based on the features that describe those items. A model should be able to explain a pattern within a dataset and predict the classes of future data points based off of this pattern. The better the model explains the relationship between the features of the training set, the more “fit” our model is.

A model that poorly explains the relationship between the features of the training data and thus fails to accurately classify future data examples is underfitting the training data. If you were to graph the predicted relationship of an underfitting model against the actual intersection of the features and labels, the predictions would veer off the mark. If we had a graph with the actual values of a training set labeled, a severely underfitting model would drastically miss most of the data points. A model with a better fit might cut a path through the center of the data points, with individual data points being off of the predicted values by only a little.

Underfitting can often occur when there is insufficient data to create an accurate model, or when trying to design a linear model with non-linear data. More training data or more features will often help reduce underfitting.

So why wouldn’t we just create a model that explains every point in the training data perfectly? Surely perfect accuracy is desirable? Creating a model that has learned the patterns of the training data too well is what causes overfitting. The training data set and other, future datasets you run through the model will not be exactly the same. They will likely be very similar in many respects, but they will also differ in key ways. Therefore, designing a model that explains the training dataset perfectly means you end up with a theory about the relationship between features that doesn’t generalize well to other datasets.

## Understanding Overfitting

Overfitting occurs when a model learns the details within the training dataset too well, causing the model to suffer when predictions are made on outside data. This may occur when the model not only learns the features of the dataset, it also learns random fluctuations or noise within the dataset, placing importance on these random/unimportant occurrences.

Overfitting is more likely to occur when nonlinear models are used, as they are more flexible when learning data features. Nonparametric machine learning algorithms often have various parameters and techniques that can be applied to constrain the model’s sensitivity to data and thereby reduce overfitting. As an example, decision tree models are highly sensitive to overfitting, but a technique called pruning can be used to randomly remove some of the detail that the model has learned.

If you were to graph out the predictions of the model on X and Y axes, you would have a line of prediction that zigzags back and forth, which reflects the fact that the model has tried too hard to fit all the points in the dataset into its explanation.

## Controlling Overfitting And Getting A Good Fit

When we train a model, we ideally want the model to make no errors. When the model’s performance converges towards making correct predictions on all the data points in the training dataset, the fit is becoming better. A model with a good fit is able to explain almost all of the training dataset without overfitting.

As a model trains its performance improves over time. The model’s error rate will decrease as training time passes, but it only decreases to a certain point. The point at which the model’s performance on the test set begins to rise again is typically the point at which overfitting is occurring. In order to get the best fit for a model, we want to stop training the model at the point of lowest loss on the training set, before error starts increasing again. The optimal stopping point can be ascertained by graphing the performance of the model throughout the training time and stopping training when loss is lowest. However, one risk with this method of controlling for overfitting is that specifying the endpoint for the training based on test performance means that the test data becomes somewhat included in the training procedure, and it loses its status as purely “untouched” data.

There are a couple of different ways that one can combat overfitting. One method of reducing overfitting is to use a resampling tactic, which operates by estimating the accuracy of the model. You can also use a validation dataset in addition to the test set and plot the training accuracy against the validation set instead of the test dataset. This keeps your test dataset unseen. A popular resampling method is K-folds cross-validation. This technique enables you to divide your data into subsets that the model is trained on, and then the performance of the model on the subsets is analyzed to estimate how the model will perform on outside data.

Making use of cross-validation is one of the best ways to estimate a model’s accuracy on unseen data, and when combined with a validation dataset overfitting can often be kept to a minimum.