Natural Language Processing (NLP) is the study and application of techniques and tools that enable computers to process, analyze, interpret, and reason about human language. NLP is an interdisciplinary field and it combines techniques established in fields like linguistics and computer science. These techniques are used in concert with AI to create chatbots and digital assistants like Google Assistant and Amazon’s Alexa.
Let’s take some time to explore the rationale behind Natural Language Processing, some of the techniques used in NLP, and some common uses cases for NLP.
Why Is Natural Language Processing Important?
In order for computers to interpret human language, they must be converted into a form that a computer can manipulate. However, this isn’t as simple as converting text data into numbers. In order to derive meaning from human language, patterns have to be extracted from the hundreds or thousands of words that make up a text document. This is no easy task. There are few hard and fast rules that can be applied to the interpretation of human language. For instance, the exact same set of words can mean different things depending on the context. Human language is a complex and often ambiguous thing, and a statement can be uttered with sincerity or sarcasm.
Despite this, there are some general guidelines that can be used when interpreting words and characters, such as the character “s” being used to denote that an item is plural. These general guidelines have to be used in concert with each other to extract meaning from the text, to create features that a machine learning algorithm can interpret.
Natural Language Processing involves the application of various algorithms capable of taking unstructured data and converting it into structured data. If these algorithms are applied in the wrong manner, the computer will often fail to derive the correct meaning from the text. This can often be seen in the translation of text between languages, where the precise meaning of the sentence is often lost. While machine translation has improved substantially over the past few years, machine translation errors still occur frequently.
Natural Language Processing Techniques
Many of the techniques that are used in natural language processing can be placed in one of two categories: syntax or semantics. Syntax techniques are those that deal with the ordering of words, while semantic techniques are the techniques that involve the meaning of words.
Syntax NLP Techniques
Examples of syntax include:
- Morphological Segmentation
- Part-of-Speech Tagging
- Sentence Breaking
- Word Segmentation
Lemmatization refers to distilling the different inflections of a word down to a single form. Lemmatization takes things like tenses and plurals and simplifies them, for example, “feet” might become “foot” and “stripes” may become “stripe”. This simplified word form makes it easier for an algorithm to interpret the words in a document.
Morphological segmentation is the process of dividing words into morphemes or the base units of a word. These units are things like free morphemes (which can stand alone as words) and prefixes or suffixes.
Part-of-speech tagging is simply the process of identifying which part of speech every word in an input document is.
Parsing refers to analyzing all the words in a sentence and correlating them with their formal grammar labels or doing grammatical analysis for all the words.
Sentence breaking, or sentence boundary segmentation, refers to deciding where a sentence begins and ends.
Stemming is the process of reducing words down to the root form of the word. For instance, connected, connection, and connections would all be stemmed to “connect”.
Word Segmentation is the process of dividing large pieces of text down into small units, which can be words or stemmed/lemmatized units.
Semantic NLP Techniques
Semantic NLP techniques include techniques like:
- Named Entity Recognition
- Natural Language Generation
- Word-Sense disambiguation
Named entity recognition involves tagging certain text portions that can be placed into one of a number of different preset groups. Pre-defined categories include things like dates, cities, places, companies, and individuals.
Natural language generation is the process of using databases to transform structured data into natural language. For instance, statistics about the weather, like temperature and wind speed could be summarized with natural language.
Word-sense disambiguation is the process of assigning meaning to words within a text based on the context the words appear in.
Deep Learning Models For Natural Language Processing
Regular multilayer perceptrons are unable to handle the interpretation of sequential data, where the order of the information is important. In order to deal with the importance of order in sequential data, a type of neural network is used that preserves information from previous timesteps in the training.
Recurrent Neural Networks are types of neural networks that loop over data from previous timesteps, taking them into account when calculating the weights of the current timestep. Essentially, RNN’s have three parameters that are used during the forward training pass: a matrix based on the Previous Hidden State, a matrix based on the Current Input, and a matrix that is between the hidden state and the output. Because RNNs can take information from previous timesteps into account, they can extract relevant patterns from text data by taking earlier words in the sentence into account when interpreting the meaning of a word.
Another type of deep learning architecture used to process text data is a Long Short-Term Memory (LSTM) network. LSTM networks are similar to RNNs in structure, but owing to some differences in their architecture they tend to perform better than RNNs. They avoid a specific problem that often occurs when using RNNs called the exploding gradient problem.
These deep neural networks can be either unidirectional or bi-directional. Bi-directional networks are capable of taking not just the words that come prior to the current word into account, but the words that come after it. While this leads to higher accuracy, it is more computationally expensive.
Use Cases For Natural Language Processing
Because Natural Language Processing involves the analysis and manipulation of human languages, it has an incredibly wide range of applications. Possible applications for NLP include chatbots, digital assistants, sentiment analysis, document organization, talent recruitment, and healthcare.
Chatbots and digital assistants like Amazon’s Alexa and Google Assistant are examples of voice recognition and synthesis platforms that use NLP to interpret and respond to vocal commands. These digital assistants help people with a wide variety of tasks, letting them offload some of their cognitive tasks to another device and free up some of their brainpower for other, more important things. Instead of looking up the best route to the bank on a busy morning, we can just have our digital assistant do it.
Sentiment analysis is the use of NLP techniques to study people’s reactions and feelings to a phenomenon, as communicated by their use of language. Capturing the sentiment of a statement, like interpreting whether a review of a product is good or bad, can provide companies with substantial information regarding how their product is being received.
Automatically organizing text documents is another application of NLP. Companies like Google and Yahoo use NLP algorithms to classify email documents, putting them in the appropriate bins such as “social” or “promotions”. They also use these techniques to identify spam and prevent it from reaching your inbox.
Groups have also developed NLP techniques are being used to identify potential job hires, finding them based on relevant skills. Hiring managers are also using NLP techniques to help them sort through lists of applicants.
NLP techniques are also being used to enhance healthcare. NLP can be used to improve the detection of diseases. Health records can be analyzed and symptoms extracted by NLP algorithms, which can then be used to suggest possible diagnoses. One example of this is Amazon’s Comprehend Medical platform, which analyzes health records and extracts diseases and treatments. Healthcare applications of NLP also extend to mental health. There are apps such as WoeBot, which talks users through a variety of anxiety management techniques based in Cognitive Behavioral Therapy.
To Learn More
|Recommended Natural Language Processing Courses||Offered By||Duration||Difficulty|
Deep Learning AI
Higher School of Economics
Structured vs Unstructured Data
Unstructured data is data that isn’t organized in a pre-defined fashion or lacks a specific data model. Meanwhile, structured data is data that has clear, definable relationships between the data points, with a pre-defined model containing it. That’s the short answer on the difference between structured and unstructured data, but let’s take a closer look at the differences between the two types of data.
When it comes to computer science, data structures refer to specific ways of storing and organizing data. Different data structures possess different relationships between data points, but data can also be unstructured. What does it mean to say that data is structured? To make this definition clearer, let’s take a look at some of the various ways of structuring data.
Structured data is often held in tables such as Excel files or SQL databases. In these cases, the rows and columns of the data hold different variables or features, and it is often possible to discern the relationship between data points by checking to see where data rows and columns intersect. Structured data can easily be fit into a relational database, and examples of different features in a structured dataset can include items like names, addresses, dates, weather statistics, credit card numbers, etc. While structured data is most often text data, it is possible to store things like images and audio as structured data as well.
Common sources of structured data include things like data collected from sensors, weblogs, network data, and retail or e-commerce data. Structured data can also be generated by people filling in spreadsheets or databases with data collected from computers and other devices. For instance, data collected through online forms is often immediately fed into a data structure.
Structured data has a long history of being stored in relational databases and SQL. These storage methods are popular because of the ease of reading and writing in these formats, with most platforms and languages being able to interpret these data formats.
In a machine learning context, structured data is easier to train a machine learning system on, because the patterns within the data are more explicit. Certain features can be fed into a machine learning classifier and used to label other data instances based on those selected features. In contrast, training a machine learning system on unstructured data tends to be more difficult, for reasons that will become clear.
Unstructured data is data that isn’t organized according to a pre-defined data model or structure. Unstructured data is often called qualitative data because it can’t be analyzed or processed in traditional ways using the regular methods used for structured data.
Because unstructured data doesn’t have any defined relationships between data points, it can’t be organized in relational databases. In contrast, the way unstructured data is stored is typically with a NoSQL database, or a non-relational database. If the structure of the database is of little concern, a data lake, or a large pool of unstructured data, can be used to store the data instead of a NoSQL database.
Unstructured data is difficult to analyze, and making sense of unstructured data often involves examining individual pieces of data to discern potential features and then looking to see if those features occur in other pieces of data within the pool.
The vast majority of data is in unstructured formats, with estimates that unstructured data comprises around 80% of all data. Data mining techniques can be used to help structure data.
In terms of machine learning, certain techniques can help order unstructured data and turn it into structured data. A popular tool for turning unstructured data into structured data is a system called an autoencoder.
What is Transfer Learning?
When practicing machine learning, training a model can take a long time. Creating a model architecture from scratch, training the model, and then tweaking the model is a massive amount of time and effort. A far more efficient way to train a machine learning model is to use an architecture that has already been defined, potentially with weights that have already been calculated. This is the main idea behind transfer learning, taking a model that has already been used and repurposing it for a new task.
Before delving into the different ways that transfer learning can be used, let’s take a moment to understand why transfer learning is such a powerful and useful technique.
Solving A Deep Learning Problem
When you are attempting to solve a deep learning problem, like building an image classifier, you have to create a model architecture and then train the model on your data. Training the model classifier involves adjusting the weights of the network, a process that can take hours or even days depending on the complexity of both the model and the dataset. The training time will scale in accordance with the size of the dataset and the complexity of the model architecture.
If the model does not achieve the kind of accuracy needed for the task, tweaking of the model will likely need to be done and then the model will need to be retrained. This means more hours of training until an optimal architecture, training length, and dataset partition can be found. When you consider how many variables must be aligned with one another for a classifier to be useful, it makes sense that machine learning engineers are always looking for easier, more efficient ways to train and implement models. For this reason, the transfer learning technique was created.
After designing and testing a model, if the model proved useful, it can be saved and reused later for similar problems.
Types Of Transfer Learning
In general, there are two different kinds of transfer learning: developing a model from scratch and using a pre-trained model.
When you develop a model from scratch, you’ll need to create a model architecture capable of interpreting your training data and extracting patterns from it. After the model is trained for the first time, you’ll probably need to make changes to it in order to get the optimal performance out of the model. You can then save the model architecture and use it as a starting point for a model that will be used on a similar task.
In the second condition – the use of a pre-trained model – you merely have to select a pre-trained model to utilize. Many universities and research teams will make the specifications of their model available for general use. The architecture of the model can be downloaded along with the weights.
When conducting transfer learning, the entire model architecture and weights can be used for the task at hand, or just certain portions/layers of the model can be used. Using only some of the pre-trained model and training the rest of the model is referred to as fine-tuning.
Finetuning a network describes the process of training just some of the layers in a network. If a new training dataset is much like the dataset used to train the original model, many of the same weights can be used.
The number of layers in the network that should be unfrozen and retrained should scale in accordance with the size of the new dataset. If the dataset that is being trained on is small, it is a better practice to hold the majority of the layers as they are and train just the final few layers. This is to prevent the network from overfitting. Alternatively, the final layers of the pre-trained network can be removed and new layers are added, which are then trained. In contrast, if the dataset is a large dataset, potentially larger than the original dataset, the entire network should be retrained. To use the network as a fixed feature extractor, the majority of the network can be used to extract the features while just the final layer of the network can be unfrozen and trained.
When you are finetuning a network, just remember that the earlier layers of the ConvNet are what contain the information representing the more generic features of the images. These are features like edges and colors. In contrast, the ConvNet’s later layers hold the details that are more specific to the individual classes held within the dataset that the model was initially trained on. If you are training a model on a dataset that is quite different from the original dataset, you’ll probably want to use the initial layers of the model to extract features and just retrain the rest of the model.
Transfer Learning Examples
The most common applications of transfer learning are probably those that use image data as inputs. These are often prediction/classification tasks. The way Convolutional Neural Networks interpret image data lends itself to reusing aspects of models, as the convolutional layers often distinguish very similar features. One example of a common transfer learning problem is the ImageNet 1000 task, a massive dataset full of 1000 different classes of objects. Companies who develop models that achieve high performance on this dataset often release their models under licenses that let others reuse them. Some of the models that have resulted from this process include the Microsoft ResNet model, the Google Inception Model, and the Oxford VGG Model group.
To Learn More
|Recommended Artificial Intelligence Courses||Offered By||Duration||Difficulty|
University of Washington
What is a Decision Tree?
A decision tree is a useful machine learning algorithm used for both regression and classification tasks. The name “decision tree” comes from the fact that the algorithm keeps dividing the dataset down into smaller and smaller portions until the data has been divided into single instances, which are then classified. If you were to visualize the results of the algorithm, the way the categories are divided would resemble a tree and many leaves.
That’s a quick definition of a decision tree, but let’s take a deep dive into how decision trees work. Having a better understanding of how decision trees operate, as well as their use cases, will assist you in knowing when to utilize them during your machine learning projects.
General Format of a Decision Tree
A decision tree is a lot like a flowchart. To utilize a flowchart you start at the starting point, or root, of the chart and then based on how you answer the filtering criteria of that starting node you move to one of the next possible nodes. This process is repeated until an ending is reached.
Decision trees operate in essentially the same manner, with every internal node in the tree being some sort of test/filtering criteria. The nodes on the outside, the endpoints of the tree, are the labels for the datapoint in question and they are dubbed “leaves”. The branches that lead from the internal nodes to the next node are features or conjunctions of features. The rules used to classify the datapoints are the paths that run from the root to the leaves.
Steps and Algorithms
Decision trees operate on an algorithmic approach which splits the dataset up into individual data points based on different criteria. These splits are done with different variables, or the different features of the dataset. For example, if the goal is to determine whether or not a dog or cat is being described by the input features, variables the data is split on might be things like “claws” and “barks”.
So what algorithms are used to actually split the data into branches and leaves? There are various methods that can be used to split a tree up, but the most common method of splitting is probably a technique referred to as “recursive binary split”. When carrying out this method of splitting, the process starts at the root and the number of features in the dataset represents the possible number of possible splits. A function is used to determine how much accuracy every possible split will cost, and the split is made using the criteria that sacrifices the least accuracy. This process is carried out recursively and sub-groups are formed using the same general strategy.
In order to determine the cost of the split, a cost function is used. A different cost function is used for regression tasks and classification tasks. The goal of both cost functions is to determine which branches have the most similar response values, or the most homogenous branches. Consider that you want test data of a certain class to follow certain paths and this makes intuitive sense.
In terms of the regression cost function for recursive binary split, the algorithm used to calculate the cost is as follows:
sum(y – prediction)^2
The prediction for a particular group of data points is the mean of the responses of the training data for that group. All the data points are run through the cost function to determine the cost for all the possible splits and the split with the lowest cost is selected.
Regarding the cost function for classification, the function is as follows:
G = sum(pk * (1 – pk))
This is the Gini score, and it is a measurement of the effectiveness of a split, based on how many instances of different classes are in the groups resulting from the split. In other words, it quantifies how mixed the groups are after the split. An optimal split is when all the groups resulting from the split consist only of inputs from one class. If an optimal split has been created the “pk” value will be either 0 or 1 and G will be equal to zero. You might be able to guess that the worst-case split is one where there is a 50-50 representation of the classes in the split, in the case of binary classification. In this case, the “pk” value would be 0.5 and G would also be 0.5.
The splitting process is terminated when all the data points have been turned into leaves and classified. However, you may want to stop the growth of the tree early. Large complex trees are prone to overfitting, but several different methods can be used to combat this. One method of reducing overfitting is to specify a minimum number of data points that will be used to create a leaf. Another method of controlling for overfitting is restricting the tree to a certain maximum depth, which controls how long a path can stretch from the root to a leaf.
Another process involved in the creation of decision trees is pruning. Pruning can help increase the performance of a decision tree by stripping out branches containing features that have little predictive power/little importance for the model. In this way, the complexity of the tree is reduced, it becomes less likely to overfit, and the predictive utility of the model is increased.
When conducting pruning, the process can start at either the top of the tree or the bottom of the tree. However, the easiest method of pruning is to start with the leaves and attempt to drop the node that contains the most common class within that leaf. If the accuracy of the model doesn’t deteriorate when this is done, then the change is preserved. There are other techniques used to carry out pruning, but the method described above – reduced error pruning – is probably the most common method of decision tree pruning.
Considerations For Using Decision Trees
Decision trees are often useful when classification needs to be carried out but computation time is a major constraint. Decision trees can make it clear which features in the chosen datasets wield the most predictive power. Furthermore, unlike many machine learning algorithms where the rules used to classify the data may be hard to interpret, decision trees can render interpretable rules. Decision trees are also able to make use of both categorical and continuous variables which means that less preprocessing is needed, compared to algorithms that can only handle one of these variable types.
Decision trees tend not to perform very well when used to determine the values of continuous attributes. Another limitation of decision trees is that, when doing classification, if there are few training examples but many classes the decision tree tends to be inaccurate.
To Learn More
|Recommended Artificial Intelligence Courses||Offered By||Duration||Difficulty|
University of Washington