Connect with us

AI 101

What is Machine Learning?

mm

Published

 on

What is Machine Learning?

Machine learning is one of the quickest growing technological fields, but despite how often the words “machine learning” are tossed around, it can be difficult to understand what machine learning is, precisely.

Machine learning doesn’t refer to just one thing, it’s an umbrella term that can be applied to many different concepts and techniques. Understanding machine learning means being familiar with different forms of model analysis, variables, and algorithms. Let’s take a close look at machine learning to better understand what it encompasses.

What Is Machine Learning?

While the term machine learning can be applied to many different things, in general, the term refers to enabling a computer to carry out tasks without receiving explicit line-by-line instructions to do so. A machine learning specialist doesn’t have to write out all the steps necessary to solve the problem because the computer is capable of “learning” by analyzing patterns within the data and generalizing these patterns to new data.

Machine learning systems have three basic parts:

  • Inputs
  • Algorithms
  • Outputs

The inputs are the data that is fed into the machine learning system, and the input data can be divided into labels and features. Features are the relevant variables, the variables that will be analyzed to learn patterns and draw conclusions. Meanwhile, the labels are classes/descriptions given to the individual instances of the data.

Features and labels can be used in two different types of machine learning problems: supervised learning and unsupervised learning.

Unsupervised vs. Supervised Learning

In supervised learning, the input data is accompanied by a ground truth. Supervised learning problems have the correct output values as part of the dataset, so the expected classes are known in advance. This makes it possible for the data scientist to check the performance of the algorithm by testing the data on a test dataset and seeing what percentage of items were correctly classified.

In contrast, unsupervised learning problems do not have ground truth labels attached to them. A machine learning algorithm trained to carry out unsupervised learning tasks must be able to infer the relevant patterns in the data for itself.

Supervised learning algorithms are typically used for classification problems, where one has a large dataset filled with instances that must be sorted into one of many different classes. Another type of supervised learning is a regression task, where the value output by the algorithm is continuous in nature instead of categorical.

Meanwhile, unsupervised learning algorithms are used for tasks like density estimation, clustering, and representation learning. These three tasks need the machine learning model to infer the structure of the data, there are no predefined classes given to the model.

Let’s take a brief look at some of the most common algorithms used in both unsupervised learning and supervised learning.

Supervised Learning

Common supervised learning algorithms include:

Support Vector Machines are algorithms that divide up a dataset into different classes. Data points are grouped into clusters by drawing lines that separate the classes from one another. Points found on one side of the line will belong to one class, while the points on the other side of the line are a different class. Support Vector Machines aim to maximize the distance between the line and the points found on either side of the line, and the greater the distance the more confident the classifier is that the point belongs to one class and not another class.

Logistic Regression is an algorithm used in binary classification tasks when data points need to be classified as belonging to one of two classes. Logistic Regression works by labeling the data point either a 1 or a 0. If the perceived value of the data point is 0.49 or below, it is classified as 0, while if it is 0.5 or above it is classified as 1.

Decision Tree algorithms operate by dividing datasets up into smaller and smaller fragments. The exact criteria used to divide the data is up to the machine learning engineer, but the goal is to ultimately divide the data up into single data points, which will then be classified using a key.

A Random Forest algorithm is essentially many single Decision Tree classifiers linked together into a more powerful classifier.

The Naive Bayes Classifier calculates the probability that a given data point has occurred based on the probability of a prior event occurring. It is based on Bayes Theorem and it places the data points into classes based on their calculated probability. When implementing a Naive Bayes classifier, it is assumed that all the predictors have the same influence on the class outcome.

An Artificial Neural Network, or multi-layer perceptron, are machine learning algorithms inspired by the structure and function of the human brain. Artificial neural networks get their name from the fact that they are made out of many nodes/neurons linked together. Every neuron manipulates the data with a mathematical function. In artificial neural networks, there are input layers, hidden layers, and output layers.

The hidden layer of the neural network is where the data is actually interpreted and analyzed for patterns. In other words, it is where the algorithm learns. More neurons joined together make more complex networks capable of learning more complex patterns.

Unsupervised Learning

Unsupervised Learning algorithms include:

  • K-means clustering
  • Autoencoders
  • Principal Component Analysis

K-means clustering is an unsupervised classification technique, and it works by separating points of data into clusters or groups based on their features. K-means clustering analyzes the features found in the data points and distinguishes patterns in them that make the data points found in a given class cluster more similar to each other than they are are to clusters containing the other data points. This is accomplished by placing possible centers for the cluster, or centroids, in a graph of the data and reassigning the position of the centroid until a position is found that minimizes the distance between the centroid and the points that belong to that centroid’s class. The researcher can specify the desired number of clusters.

Principal Component Analysis is a technique that reduces large numbers of features/variables down into a smaller feature space/fewer features. The “principal components” of the data points are selected for preservation, while the other features are squeezed down into a smaller representation. The relationship between the original data potions is preserved, but since the complexity of the data points is simpler, the data is easier to quantify and describe.

Autoencoders are versions of neural networks that can be applied to unsupervised learning tasks. Autoencoders are capable of taking unlabeled, free-form data and transforming them into data that a neural network is capable of using, basically creating their own labeled training data. The goal of an autoencoder is to convert the input data and rebuild it as accurately as possible, so it’s in the incentive of the network to determine which features are the most important and extract them.

To Learn More

Recommended Machine Learning CoursesOffered ByDurationDifficulty


Introduction to Artificial Intelligence



IBM

9 Hours

Beginner


Deep Learning for Business


Yonsei University

8 Hours

Beginner


An Introduction to Practical Deep Learning


Intel Software

12 Hours

Intermediate


Machine Learning Foundations


University of Washingotn

24 Hours

Intermediate
Spread the love

Blogger and programmer with specialties in Machine Learning and Deep Learning topics. Daniel hopes to help others use the power of AI for social good.

AI 101

What is Bayes Theorem?

mm

Published

on

What is Bayes Theorem?

If you’ve been learning about data science or machine learning, there’s a good chance you’ve heard the term “Bayes Theorem” before, or a “Bayes classifier”. These concepts can be somewhat confusing, especially if you aren’t used to thinking of probability from a traditional, frequentist statistics perspective. This article will attempt to explain the principles behind Bayes Theorem and how it’s used in machine learning.

Defining Bayes Theorem

Bayes Theorem is a method of calculating conditional probability. The traditional method of calculating conditional probability (the probability that one event occurs given the occurrence of a different event) is to use the conditional probability formula, calculating the joint probability of event one and event two occurring at the same time, and then dividing it by the probability of event two occurring. However, conditional probability can also be calculated in a slightly different fashion by using Bayes Theorem.

When calculating conditional probability with Bayes theorem, you use the following steps:

  • Determine the probability of condition B being true, assuming that condition A is true.
  • Determine the probability of event A being true.
  • Multiply the two probabilities together.
  • Divide by the probability of event B occurring.

This means that the formula for Bayes Theorem could be expressed like this:

P(A|B) = P(B|A)*P(A) / P(B)

Calculating the conditional probability like this is especially useful when the reverse conditional probability can be easily calculated, or when calculating the joint probability would be too challenging.

A Practical Example

This might be easier to interpret if we spend some time looking at an example of how you would apply Bayesian reasoning and Bayes Theorem. Let’s assume you were playing a simple game where multiple participants tell you a story and you have to determine which one of the participants is lying to you. Let’s fill in the equation for Bayes Theorem with the variables in this hypothetical scenario.

We’re trying to predict whether each individual in the game is lying or telling the truth, so if there are three players apart from you, the categorical variables can be expressed as A1, A2, and A3. The evidence for their lies/truth is their behavior. Like when playing poker, you would look for certain “tells” that a person is lying and use those as bits of information to inform your guess. Or if you were allowed to question them it would be any evidence their story doesn’t add up. We can represent the evidence that a person is lying as B.

To be clear, we’re aiming to predict Probability(A is lying/telling the truth|given the evidence of their behavior). To do this we’d want to figure out the probability of B given A, or the probability that their behavior would occur given the person genuinely lying or telling the truth. You’re trying to determine under which conditions the behavior you are seeing would make the most sense. If there are three behaviors you are witnessing, you would do the calculation for each behavior. For example, P(B1, B2, B3 * A). You would then do this for every occurrence of A/for every person in the game aside from yourself. That’s this part of the equation above:

P(B1, B2, B3,|A) * P|A

Finally, we just divide that by the probability of B.

If we received any evidence about the actual probabilities in this equation, we would recreate our probability model, taking the new evidence into account. This is called updating your priors, as you update your assumptions about the prior probability of the observed events occurring.

Machine Learning Applications

The most common use of Bayes theorem when it comes to machine learning is in the form of the Naive Bayes algorithm.

Naive Bayes is used for the classification of both binary and multi-class datasets, Naive Bayes gets its name because the values assigned to the witnesses evidence/attributes – Bs in P(B1, B2, B3 * A) – are assumed to be independent of one another. It’s assumed that these attributes don’t impact each other in order to simplify the model and make calculations possible, instead of attempting the complex task of calculating the relationships between each of the attributes. Despite this simplified model, Naive Bayes tends to perform quite well as a classification algorithm, even when this assumption probably isn’t true (which is most of the time).

There are also commonly used variants of the Naive Bayes classifier such as Multinomial Naive Bayes, Bernoulli Naive Bayes, and Gaussian Naive Bayes.

Multinomial Naive Bayes algorithms are often used to classify documents, as it is effective at interpreting the frequency of words within a document.

Bernoulli Naive Bayes operates similarly to Multinomial Naive Bayes, but the predictions rendered by the algorithm are booleans. This means that when predicting a class the values will be binary, no or yes. In the domain of text classification, a Bernoulli Naive Bayes algorithm would assign the parameters a yes or no based on whether or not a word is found within the text document.

If the value of the predictors/features aren’t discrete but are instead continuous, Gaussian Naive Bayes can be used. It’s assumed that the values the continuous features have been sampled from a gaussian distribution.

Spread the love
Continue Reading

AI 101

What are RNNs and LSTMs in Deep Learning?

mm

Published

on

What are RNNs and LSTMs in Deep Learning?

Many of the most impressive advances in natural language processing and AI chatbots are driven by Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. RNNs and LSTMs are special neural network architectures that are able to process sequential data, data where chronological ordering matters. LSTMs are essentially improved versions of RNNs, capable of interpreting longer sequences of data. Let’s take a look at how RNNs and LSTMS are structured and how they enable the creation of sophisticated natural language processing systems.

Feed-Forward Neural Networks

So before we talk about how Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN) work, we should discuss the format of a neural network in general.

A neural network is intended to examine data and learn relevant patterns, so that these patterns can be applied to other data and new data can be classified. Neural networks are divided into three sections: an input layer, a hidden layer (or multiple hidden layers), and an output layer.

The input layer is what takes in the data into the neural network, while the hidden layers are what learn the patterns in the data. The hidden layers in the dataset are connected to the input and output layers by “weights” and “biases” which are just assumptions of how the data points are related to each other. These weights are adjusted during training. As the network trains, the model’s guesses about the training data (the output values) are compared against the actual training labels. During the course of training, the network should (hopefully) get more accurate at predicting relationships between data points, so it can accurately classify new data points. Deep neural networks are networks that have more layers in the middle/more hidden layers. The more hidden layers and more neurons/nodes the model has, the better the model can recognize patterns in the data.

Regular, feed-forward neural networks, like the ones I’ve described above are often called “dense neural networks”. These dense neural networks are combined with different network architectures that specialize in interpreting different kinds of data.

Recurrent Neural Networks

What are RNNs and LSTMs in Deep Learning?

Photo: fdeloche via Wikimedia Commons, CC BY S.A 4.0 (https://commons.wikimedia.org/wiki/File:Recurrent_neural_network_unfold.svg)

Recurrent Neural Networks take the general principle of feed-forward neural networks and enable them to handle sequential data by giving the model an internal memory. The “Recurrent” portion of the RNN name comes from the fact that the input and outputs loop. Once the output of the network is produced, the output is copied and returned to the network as input. When making a decision, not only the current input and output are analyzed, but the previous input is also considered. To put that another way, if the initial input for the network is X and the output is H, both H and X1 (the next input in the data sequence) are fed into the network for the next round of learning. In this way, the context of the data (the previous inputs) is preserved as the network trains.

The result of this architecture is that RNNs are capable fo handling sequential data. However, RNNs suffer from a couple of issues. RNNs suffer from the vanishing gradient and exploding gradient problems.

The length of sequences that an RNN can interpret are rather limited, especially in comparison to LSTMs.

Long Short-Term Memory Networks

Long Short-Term Memory networks can be considered extensions of RNNs, once more applying the concept of preserving the context of inputs. However, LSTMs have been modified in several important ways that allow them to interpret past data with superior methods. The alterations made to LSTMs deal with the vanishing gradient problem and enable LSTMs to consider much longer input sequences.

What are RNNs and LSTMs in Deep Learning?

Photo: By https://commons.wikimedia.org/wiki/User:BiObserve (Raster version previously uploaded to Wikimedia)Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton (original)Eddie Antonio Santos (SVG version with TeX math) – https://commons.wikimedia.org/wiki/File:Long_Short_Term_Memory.pngAlex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6645–6649. IEEE, 2013., CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=59931189

LSTM models are made up of three different components, or gates. There’s an input gate, an output gate, and a forget gate. Much like RNNs, LSTMs take inputs from the previous timestep into account when modifying the model’s memory and input weights. The input gate makes decisions about which values are important and should be let through the model. A sigmoid function is used in the input gate, which makes determinations about which values to pass on through the recurrent network. Zero drops the value, while 1 preserves it. A TanH function is used here as well, which decides how important to the model the input values are, ranging from -1 to 1.

After the current inputs and memory state are accounted for, the output gate decides which values to push to the next time step. In the output gate, the values are analyzed and assigned an importance ranging from -1 to 1. This regulates the data before it is carried on to the next time-step calculation.  Finally, the job of the forget gate is to drop information that the model deems unnecessary to make a decision about the nature of the input values. The forget gate uses a sigmoid function on the values, outputting numbers between 0 (forget this) and 1 (keep this).

An LSTM neural network is made out of both special LSTM layers that can interpret sequential word data and the densely connected like those described above. Once the data moves through the LSTM layers, it proceeds into the densely connected layers.

Spread the love
Continue Reading

AI 101

What is K-Nearest Neighbors?

mm

Published

on

What is K-Nearest Neighbors?

K-Nearest Neighbors is a machine learning technique and algorithm that can be used for both regression and classification tasks. K-Nearest Neighbors examines the labels of a chosen number of data points surrounding a target data point, in order to make a prediction about the class that the data point falls into. K-Nearest Neighbors (KNN) is a conceptually simple yet very powerful algorithm, and for those reasons, it’s one of the most popular machine learning algorithms. Let’s take a deep dive into the KNN algorithm and see exactly how it works. Having a good understanding of how KNN operates will let you appreciated the best and worst use cases for KNN.

An Overview Of KNN

What is K-Nearest Neighbors?

Photo: Antti Ajanki AnAj via Wikimedia Commons, CC BY SA 3.0 (https://commons.wikimedia.org/wiki/File:KnnClassification.svg)

Let’s visualize a dataset on a 2D plane. Picture a bunch of data points on a graph, spread out along the graph in small clusters. KNN examines the distribution of the data points and, depending on the arguments given to the model, it separates the data points into groups. These groups are then assigned a label. The primary assumption that a KNN model makes is that data points/instances which exist in close proximity to each other are highly similar, while if a data point is far away from another group it’s dissimilar to those data points.

A KNN model calculates similarity using the distance between two points on a graph. The greater the distance between the points, the less similar they are. There are multiple ways of calculating the distance between points, but the most common distance metric is just Euclidean distance (the distance between two points in a straight line).

KNN is a supervised learning algorithm, meaning that the examples in the dataset must have labels assigned to them/their classes must be known. There are two other important things to know about KNN. First, KNN is a non-parametric algorithm. This means that no assumptions about the dataset are made when the model is used. Rather, the model is constructed entirely from the provided data. Second, there is no splitting of the dataset into training and test sets when using KNN. KNN makes no generalizations between a training and testing set, so all the training data is also used when the model is asked to make predictions.

How The KNN Algorithm Operates

A KNN algorithm goes through three main phases as it is carried out:

  1. Setting K to the chosen number of neighbors.
  2. Calculating the distance between a provided/test example and the dataset examples.
  3. Sorting the calculated distances.
  4. Getting the labels of the top K entries.
  5. Returning a prediction about the test example.

In the first step, K is chosen by the user and it tells the algorithm how many neighbors (how many surrounding data points) should be considered when rendering a judgment about the group the target example belongs to. In the second step, note that the model checks the distance between the target example and every example in the dataset. The distances are then added into a list and sorted. Afterward, the sorted list is checked and the labels for the top K elements are returned. In other words, if K is set to 5, the model checks the labels of the top 5 closest data points to the target data point. When rendering a prediction about the target data point, it matters if the task is a regression or classification task. For a regression task, the mean of the top K labels is used, while the mode of the top K labels is used in the case of classification.

The exact mathematical operations used to carry out KNN differ depending on the chosen distance metric. If you would like to learn more about how the metrics are calculated, you can read about some of the most common distance metrics, such as Euclidean, Manhattan, and Minkowski.

Why The Value Of K Matters

The main limitation when using KNN is that in an improper value of K (the wrong number of neighbors to be considered) might be chosen. If this happen, the predictions that are returned can be off substantially. It’s very important that, when using a KNN algorithm, the proper value for K is chosen. You want to choose a value for K that maximizes the model’s ability to make predictions on unseen data while reducing the number of errors it makes.

What is K-Nearest Neighbors?

Photo: Agor153 via Wikimedia Commons, CC BY SA 3.0 (https://en.wikipedia.org/wiki/File:Map1NN.png)

Lower values of K mean that the predictions rendered by the KNN are less stable and reliable. To get an intuition of why this is so, consider a case where we have 7 neighbors around a target data point. Let’s assume that the KNN model is working with a K value of 2 (we’re asking it to look at the two closest neighbors to make a prediction). If the vast majority of the neighbors (five out of seven) belong to the Blue class, but the two closest neighbors just happen to be Red, the model will predict that the query example is Red. Despite the model’s guess, in such a scenario Blue would be a better guess.

If this is the case, why not just choose the highest K value we can? This is because telling the model to consider too many neighbors will also reduce accuracy. As the radius that the KNN model considers increases, it will eventually start considering data points that are closer to other groups than they are the target data point and misclassification will start occurring. For example, even if the point that was initially chosen was in one of the red regions above, if K was set too high, the model would reach into the other regions to consider points. When using a KNN model, different values of K are tried to see which value gives the model the best performance.

KNN Pros And Cons

Let’s examine some of the pros and cons of the KNN model.

Pros:

KNN can be used for both regression and classification tasks, unlike some other supervised learning algorithms.

KNN is highly accurate and simple to use. It’s easy to interpret, understand, and implement.

KNN doesn’t make any assumptions about the data, meaning it can be used for a wide variety of problems.

Cons:

KNN stores most or all of the data, which means that the model requires a lot of memory and its computationally expensive. Large datasets can also cause predictions to be take a long time.

KNN proves to be very sensitive to the scale of the dataset and it can be thrown off by irrelevant features fairly easily in comparison to other models.

Summing Up

K-Nearest Neighbors is one of the simplest machine learning algorithms. Despite how simple KNN is, in concept, it’s also a powerful algorithm that gives fairly high accuracy on most problems. When you use KNN, be sure to experiment with various values of K in order to find the number that provides the highest accuracy.

Spread the love
Continue Reading