Connect with us

AI 101

What Is Synthetic Data?

mm

Updated

 on

What is Synthetic Data?

Synthetic data is a quickly expanding trend and emerging tool in the field of data science. What is synthetic data exactly? The short answer is that synthetic data is comprised of data that isn’t based on any real-world phenomena or events, rather it’s generated via a computer program. Yet why is synthetic data becoming so important for data science? How is synthetic data created? Let’s explore the answers to these questions.

What is a Synthetic Dataset?

As the term “synthetic” suggests, synthetic datasets are generated through computer programs, instead of being composed through the documentation of real-world events. The primary purpose of a synthetic dataset is to be versatile and robust enough to be useful for the training of machine learning models.

In order to be useful for a machine learning classifier, the synthetic data should have certain properties. While the data can be categorical, binary, or numerical, the length of the dataset should be arbitrary and the data should be randomly generated. The random processes used to generate the data should be controllable and based on various statistical distributions. Random noise may also be placed in the dataset.

If the synthetic data is being used for a classification algorithm, the amount of class separation should be customizable, in order that the classification problem can be made easier or harder according to the problem’s requirements. Meanwhile, for a regression task, non-linear generative processes can be employed to generate the data.

Why Use Synthetic Data?

As machine learning frameworks like TensorfFlow and PyTorch become easier to use and pre-designed models for computer vision and natural language processing become more ubiquitous and powerful, the primary problem that data scientists must face is the collection and handling of data. Companies often have difficulty acquiring large amounts of data to train an accurate model within a given time frame. Hand-labeling data is a costly, slow way to acquire data. However, generating and using synthetic data can help data scientists and companies overcome these hurdles and develop reliable machine learning models a quicker fashion.

There are a number of advantages to using synthetic data. The most obvious way that the use of synthetic data benefits data science is that it reduces the need to capture data from real-world events, and for this reason it becomes possible to generate data and construct a dataset much more quickly than a dataset dependent on real-world events. This means that large volumes of data can be produced in a short timeframe. This is especially true for events that rarely occur, as if an event rarely happens in the wild, more data can be mocked up from some genuine data samples. Beyond that, the data can be automatically labeled as it is generated, drastically reducing the amount of time needed to label data.

Synthetic data can also be useful to gain training data for edge cases, which are instances that may occur infrequently but are critical for the success of your AI. Edge cases are events that are very similar to the primary target of an AI but differ in important ways. For instance, objects that are only partially in view could be considered edge cases when designing an image classifier.

Finally, synthetic datasets can minimize privacy concerns. Attempts to anonymize data can be ineffective, as even if sensitive/identifying variables are removed from the dataset, other variables can act as identifiers when they are combined. This isn’t an issue with synthetic data, as it was never based on a real person, or real event, in the first place.

Uses Cases for Synthetic Data

Synthetic data has a wide variety of uses, as it can be applied to just about any machine learning task. Common use cases for synthetic data include self-driving vehicles, security, robotics, fraud protection, and healthcare.

One of the initial use cases for synthetic data was self-driving cars, as synthetic data is used to create training data for cars in conditions where getting real, on-the-road training data is difficult or dangerous. Synthetic data is also useful for the creation of data used to train image recognition systems, like surveillance systems, much more efficiently than manually collecting and labeling a bunch of training data. Robotics systems can be slow to train and develop with traditional data collection and training methods. Synthetic data allows robotics companies to test and engineer robotics systems through simulations. Fraud protection systems can benefit from synthetic data, and new fraud detection methods can be trained and tested with data that is constantly new when synthetic data is used. In the healthcare field, synthetic data can be used to design health classifiers that are accurate, yet preserve people’s privacy, as the data won’t be based on real people.

Synthetic Data Challenges

While the use of synthetic data brings many advantages with it, it also brings many challenges.

When synthetic data is created, it often lacks outliers. Outliers occur in data naturally, and while often dropped from training datasets, their existence may be necessary to train truly reliable machine learning models. Beyond this, the quality of synthetic data can be highly variable. Synthetic data is often generated with an input, or seed, data, and therefore the quality of the data can be dependent on the quality of the input data. If the data used to generate the synthetic data is biased, the generated data can perpetuate that bias. Synthetic data also requires some form of output/quality control. It needs to be checked against human-annotated data, or otherwise authentic data is some form.

How Is Synthetic Data Created?

Synthetic data is created programmatically with machine learning techniques. Classical machine learning techniques like decision trees can be used, as can deep learning techniques. The requirements for the synthetic data will influence what type of algorithm is used to generate the data. Decision trees and similar machine learning models let companies create non-classical, multi-modal data distributions, trained on examples of real-world data. Generating data with these algorithms will provide data that is highly correlated with the original training data. For instances where the typical distribution of data is known , a company can generate synthetic data through use of a Monte Carlo method.

Deep learning-based methods of generating synthetic data typically make use of either a variational autoencoder (VAE) or a generative adversarial network (GAN). VAEs are unsupervised machine learning models that make use of encoders and decoders. The encoder portion of a VAE is responsible for compressing the data down into a simpler, compact version of the original dataset, which the decoder then analyzes and uses to generate an a representation of the base data. A VAE is trained with the goal of having an optimal relationship between the input data and output, one where both input data and output data are extremely similar.

When it comes to GAN models, they are called “adversarial” networks due to the fact that GANs are actually two networks that compete with each other. The generator is responsible for generating synthetic data, while the second network (the discriminator) operates by comparing the generated data with a real dataset and tries to determine which data is fake. When the discriminator catches fake data, the generator is notified of this and it makes changes to try and get a new batch of data by the discriminator. In turn, the discriminator becomes better and better at detecting fakes. The two networks are trained against each other, with fakes becoming more lifelike all the time.

Spread the love

Blogger and programmer with specialties in Machine Learning and Deep Learning topics. Daniel hopes to help others use the power of AI for social good.

AI 101

How Does Image Classification Work?

mm

Updated

 on

How can your phone determine what an object is just by taking a photo of it? How do social media websites automatically tag people in photos? This is accomplished through AI-powered image recognition and classification.

The recognition and classification of images is what enables many of the most impressive accomplishments of artificial intelligence. Yet how do computers learn to detect and classify images? In this article, we’ll cover the general methods that computers use to interpret and detect images and then take a look at some of the most popular methods of classifying those images.

Pixel-Level vs. Object-Based Classification

Image classification techniques can mainly be divided into two different categories: pixel-based classification and object-based classification.

Pixels are the base units of an image, and the analysis of pixels is the primary way that image classification is done. However, classification algorithms can either use just the spectral information within individual pixels to classify an image or examine spatial information (nearby pixels) along with the spectral information. Pixel-based classification methods utilize only spectral information (the intensity of a pixel), while object-based classification methods take into account both pixel spectral information and spatial information.

There are different classification techniques used for pixel-based classification. These include minimum-distance-to-mean, maximum-likelihood, and minimum-Mahalanobis-distance. These methods require that the means and variances of the classes are known, and they all operate by examining the “distance” between class means and the target pixels.

Pixel-based classification methods are limited by the fact that they can’t use information from other nearby pixels. In contrast, object-based classification methods can include other pixels and therefore they also use spatial information to classify items. Note that “object” just refers to contiguous regions of pixels and not whether or not there is a target object within that region of pixels.

Preprocessing Image Data For Object Detection

The most recent and reliable image classification systems primarily use object-level classification schemes, and for these approaches image data must be prepared in specific ways. The objects/regions need to be selected and preprocessed.

Before an image, and the objects/regions within that image, can be classified the data that comprises that image has to be interpreted by the computer. Images need to be preprocessed and readied for input into the classification algorithm, and this is done through object detection. This is a critical part of readying the data and preparing the images to train the machine learning classifier.

Object detection is done with a variety of methods and techniques. To begin with, whether or not there are multiple objects of interest or a single object of interest impacts how the image preprocessing is handled. If there is just one object of interest, the image undergoes image localization. The pixels that comprise the image have numerical values that are interpreted by the computer and used to display the proper colors and hues. An object known as a bounding box is drawn around the object of interest, which helps the computer know what part of the image is important and what pixel values define the object. If there are multiple objects of interest in the image, a technique called object detection is used to apply these bounding boxes to all the objects within the image.

Photo: Adrian Rosebrock via Wikimedia Commons, CC BY SA 4.0 (https://commons.wikimedia.org/wiki/File:Intersection_over_Union_-_object_detection_bounding_boxes.jpg)

Another method of preprocessing is image segmentation. Image segmentation functions by dividing the whole image into segments based on similar features. Different regions of the image will have similar pixel values in comparison to other regions of the image, so these pixels are grouped together into image masks that correspond to the shape and boundaries of the relevant objects within the image. Image segmentation helps the computer isolate the features of the image that will help it classify an object, much like bounding boxes do, but they provide much more accurate, pixel-level labels.

After the object detection or image segmentation has been completed, labels are applied to the regions in question. These labels are fed, along with the values of the pixels comprising the object, into the machine learning algorithms that will learn patterns associated with the different labels.

Machine Learning Algorithms

Once the data has been prepared and labeled, the data is fed into a machine learning algorithm, which trains on the data. We’ll cover some of the most common kinds of machine learning image classification algorithms below.

K-Nearest Neighbors

K-Nearest Neighbors is a classification algorithm that examines the closest training examples and looks at their labels to ascertain the most probable label for a given test example. When it comes to image classification using KNN, the feature vectors and labels of the training images are stored and just the feature vector is passed into the algorithm during testing. The training and testing feature vectors are then compared against each other for similarity.

KNN-based classification algorithms are extremely simple and they deal with multiple classes quite easily. However, KNN calculates similarity based on all features equally. This means that it can be prone to misclassification when provided with images where only a subset of the features is important for the classification of the image.

Support Vector Machines

Support Vector Machines are a classification method that places points in space and then draws dividing lines between the points, placing objects in different classes depending on which side of the dividing plane the points fall on. Support Vector Machines are capable of doing nonlinear classification through the use of a technique known as the kernel trick. While SVM classifiers are often very accurate, a substantial drawback to SVM classifiers is that they tend to be limited by both size and speed, with speed suffering as size increases.

Multi-Layer Perceptrons (Neural Nets)

Multi-layer perceptrons, also called neural network models, are machine learning algorithms inspired by the human brain. Multilayer perceptrons are composed of various layers that are joined together with each other, much like neurons in the human brain are linked together. Neural networks make assumptions about how the input features are related to the data’s classes and these assumptions are adjusted over the course of training. Simple neural network models like the multi-layer perceptron are capable of learning non-linear relationships, and as a result, they can be much more accurate than other models. However, MLP models suffer from some notable issues like the presence of non-convex loss functions.

Deep Learning Algorithms (CNNs)

Photo: APhex34 via Wikimedia Commons, CC BY SA 4.0 (https://commons.wikimedia.org/wiki/File:Typical_cnn.png)

The most commonly used image classification algorithm in recent times is the Convolutional Neural Network (CNNs). CNNs are customized versions of neural networks that combine the multilayer neural networks with specialized layers that are capable of extracting the features most important and relevant to the classification of an object. CNNs can automatically discover, generate, and learn features of images. This greatly reduces the need to manually label and segment images to prepare them for machine learning algorithms. They also have an advantage over MLP networks because they can deal with non-convex loss functions.

Convolutional Neural Networks get their name from the fact that they create “convolutions”. CNNs operate by taking a filter and sliding it over an image. You can think of this as viewing sections of a landscape through a moveable window, concentrating on just the features that are viewable through the window at any one time. The filter contains numerical values which are multiplied with the values of the pixels themselves. The result is a new frame, or matrix, full of numbers that represent the original image. This process is repeated for a chosen number of filters, and then the frames are joined together into a new image that is slightly smaller and less complex than the original image. A technique called pooling is used to select just the most important values within the image, and the goal is for the convolutional layers to eventually extract just the most salient parts of the image that will help the neural network recognize the objects in the image.

Convolutional Neural Networks are comprised of two different parts. The convolutional layers are what extract the features of the image and convert them into a format that the neural network layers can interpret and learn from. The early convolutional layers are responsible for extracting the most basic elements of the image, like simple lines and boundaries. The middle convolutional layers begin to capture more complex shapes, like simple curves and corners. The later, deeper convolutional layers extract the high-level features of the image, which are what is passed into the neural network portion of the CNN, and are what the classifier learns.

Spread the love
Continue Reading

AI 101

Do you recommend Recommendation Engines?

mm

Updated

 on

In business, the needle in a haystack problem is a constant challenge. Recommendation Engines are here to help tackle that challenge. 

In e-commerce and retail, you offer hundreds or thousands of products. Which is the right product for your customers?

In sales and marketing, you have a large number of prospects in your pipeline. Yet, you only have so many hours in the day. So, you face the challenge of deciding where precisely to focus your effort.

There is a specialized technology powered by AI and Big Data, which makes these challenges much easier to manage, recommendation engines.

What are recommender systems?

In its simplest terms, a recommendation engine sorts through many items and predicts the selection most relevant to the user. For consumers, Amazon’s product recommendation engine is a familiar example. In the entertainment world, Netflix has worked hard to develop their engine. Netflix’s recommendation engine has delivered bottom-line benefits:

“[Netflix’s] sophisticated recommendation system and personalized user experience, it has allowed them to save $1 billion per year from service cancellations.” – The ROI of recommendation engines for marketing

From the end user’s perspective, it is often unclear how recommendation engines work. We’re going to pull the curtain back and explain how they work, starting with the key ingredient: data.

Recommendation Engines: What data do they use?

The data you need for a recommendation engine depends on your goal. Assume your goal is to increase sales in an e-commerce company. In that case, the bare minimum required data would fall into two categories: a product database and end-user behavior. To illustrate how this works, look at this simple example.

  • Company: USB Accessories, Inc. The company specializes in selling USB accessories and products like cables, thumb drives, and hubs to consumers and businesses.
  • Product Data. To keep the initial recommendation engine simple, the company limits it to 100 products.
  • User Data. In the case of an online store, user data will include website analytics information, email marketing, and other sources. For instance, you may find that 50% of customers who buy an external hard drive also buy USB cables.
  • Recommendation Output. In this case, your recommendation engine may generate a recommendation (or a discount code) to hard drive buyers to encourage them to buy USB cables.

In practice, the best recommendation engines use much more data. As a general rule, recommendation engines produce better business results when they have a large volume of data to use.

How do recommendation engines use your data?

Many recommendation engines use a handful of techniques to process your data.

Content-based filtering

This type of recommendation algorithm combines user preferences and attempts to recommend similar items. In this case, the engine is focused on the product and highlighting related items. This type of recommendation engine is relatively simple to build. It is a good starting point for companies with limited data.

Collaborative filtering

Have you asked somebody else for a recommendation before making a purchase? Or considered online reviews in your buying process? If so, you have experienced collaborative filtering. More advanced recommendation engines analyze user reviews, ratings, and other user-generated content to produce relevant suggestions. This type of recommendation engine strategy is powerful because it leverages social proof.

Hybrid recommenders

Hybrid recommendation engines combine two or more recommendation methods to produce better results. Returning to the e-commerce example outlined above, let’s say you have acquired user reviews and ratings (e.g., 1 to 5 stars) over the past year. Now, you can use both content-based filtering and collaborative filtering to present recommendations. Combining multiple recommendation engines or algorithms successfully usually takes experimentation. For that reason, it is best considered a relatively advanced strategy.

A recommendation engine is only successful if you feed it with high-quality data. It also cannot perform effectively if you have errors or out of date information in your company database. That’s why you need to invest resources in data quality continuously.

Case Studies: 

Hiring Automated: Candidate Scoring

There are more than 50 applicants on average per job posting, according to Jobvite research. For human resources departments and managers, that applicant volume creates a tremendous amount of work. To simplify the process, Blue Orange implemented a recommendation engine for a fortune 500 hedge fund. This HR automation project helped the company to rank candidates in a standardized way. Using ten years’ worth of applicant data and resumes, the firm now has a sophisticated scoring model to find good-fit candidates.

A Hedge Fund in New York City needed to parse resumes that were inconsistent and required OCR to improve their hiring process. Even the best OCR parsing leaves you with messy and unstructured data. Then, as a candidate moves through the application process, humans get involved. Add to the data set free form text reviews of the applicant and both linguistic and personal biases. In addition, each data source is siloed providing limited analytical opportunity.

Approach: After assessing multiple companies hiring processes, we have found three consistent opportunities to systematically improve hiring outcomes using NLP machine learning. The problem areas are: correctly structuring candidate resume data, assessing job fit, and reducing human hiring bias. With a cleaned and structured data set, we were able to perform both sentiment analysis on the text and subjectivity detection to reduce candidate bias in human assessment.

Results: Using keyword detection classifiers, optical character recognition, and cloud based NLP engines, we were able to scrub string text and turn it into relational data. With structured data, we provided a fast, interactive and searchable Business Analytics dashboard in AWS QuickSight.

E-Commerce: Zageno Medical Supplies

Another example of recommendation engines being implemented in the real-world comes from Zageno. Zageno is an e-commerce company that does for lab scientists what Amazon does for the rest of us. The caveat is that the needs of lab scientists are exact so the supplies procured for their research must be, as well. The quotes below are from our interview with Zageno and highlight how they use recommendation engines to deliver the most accurate supplies to lab scientists. 

Q&A: Blue Orange Digital interviews Zageno

Question:
How has your company used a recommendation engine and what sort of results did you see?

Answer:

There are two examples of the recommendation engines that ZAGENO employs for its scientific customers. To explain these we felt it best to bullet point them.

  • ZAGENO’s Scientific Score:
    • ZAGENO’s Scientific Score is a comprehensive product rating system, specifically developed for evaluating research products. It incorporates several aspects of product data, from multiple sources, to equip scientists with a sophisticated and unbiased product rating for making accurate purchasing decisions.
    • We apply sophisticated machine-learning algorithms to accurately match, group, and categorize millions of products. The Scientific Score accounts for these categorizations, as each product’s score is calculated relative to those in the same category. The result is a rating system that scientists can trust — one that is specific to both product application and product type.
    • Standard product ratings are useful to assess products quickly, but are often biased and unreliable, due to their reliance on unknown reviews or a single metric (e.g. publications). They also provide little detail on experimental context or application. The Scientific Score utilizes a scientific methodology to objectively and comprehensively evaluate research products. It combines all necessary and relevant product information into a single 0—10 rating to support our customers in deciding which product to buy and use for their application — saving hours of product research.
    • To ensure no single factor dominates, we add cut-off points and give more weight to recent contributions. The sheer number of factors we take into account virtually eliminates any opportunity for manipulation. As a result, our score is an objective measure of the quality and quantity of available product information, which supports our customers’ purchasing decisions.
  • Alternative Products:
    • Alternative products are defined by the same values for key attributes; key attributes are defined for each category to account for specific product characteristics.
    • We are working on increasing underlying data and attributes and improving the algorithm to improve the suggestions
    • Alternatives product suggestions are intended to help both, scientist and procurement to consider and evaluate potential products, they might not have considered/known otherwise
    • Alternative products are solely defined by product characteristics  and independent of suppliers, brand or other commercial data

Do you recommend recommendation systems? 

“Yes, but make sure you are using the right data to base your recommendation on both the quality and quantity reflecting true user expectations. Create transparency because nobody, particularly scientists, will trust or rely on a black box. Share with your users which information is used, how it is weighted, and keep on learning so as to continually improve. Finally, complete the cycle by taking the user feedback that you’ve collected and bring it back into the system.” – Zageno

The power of recommendation engines has never been greater. As shown by giants like Amazon and Netflix, recommenders can be directly responsible for increases in revenue and customer retention rates. Companies such as Zageno show that you do not need to be a massive company to leverage the power of recommenders. The benefits of recommendation engines span across many industries such as e-commerce to human resources. 

The Fast Way To Bring Recommendation Engines To Your Company

Developing a recommendation engine takes data expertise. Your internal IT team may not have the capacity to build this out. If you want to get the customer retention and efficiency benefits of recommendation engines, you don’t have to wait for IT to become less busy. Drop us a line and let us know. The Blue Orange Digital data science team is happy to make recommenders work for your benefit too!

main image source: Canva

Spread the love
Continue Reading

AI 101

How Does Text Classification Work?

mm

Updated

 on

Text classification is the process of analyzing text sequences and assigning them a label, putting them in a group based on their content. Text classification underlies almost any AI or machine learning task involving Natural Language Processing (NLP). With text classification, a computer program can carry out a wide variety of different tasks like spam recognition, sentiment analysis, and chatbot functions. How does text classification work exactly? What are the different methods of carrying out text classification? We’ll explore the answers to these questions below.

Defining Text Classification

It’s important to take some time and make sure that we understand what text classification is, in general, before delving into the different methods of doing text classification. Text classification is one of those terms that is applied to many different tasks and algorithms, so it’s useful to make sure that we understand the basic concept of text classification before moving on to explore the different ways that it can be carried out.

Anything that involves creating different categories for text, and then labeling different text samples as these categories, can be considered text classification. As long as a system carries out these basic steps it can be considered a text classifier, regardless of the exact method used to classify the text and regardless of how the text classifier is eventually applied. Detecting email spam, organizing documents by topic or title, and recognizing the sentiment of a review for a product are all examples of text classification because they are accomplished by taking text as an input and outputting a class label for that piece of text.

How Does Text Classification Work?

Photo: Quinn Dombrowski via Flickr, CC BY SA 2.0 , (https://www.flickr.com/photos/quinnanya/4714794045)

Most text classification methods can be placed into one of three different categories: rule-based methods or machine learning methods.

Rule-Based Classification Methods

Rule-based text classification methods operate through the use of explicitly engineered linguistic rules. The system uses the rules created by the engineer to determine which class a given piece of text should belong to, looking for clues in the form of semantically relevant text elements. Every rule has a pattern that the text must match to be placed into the corresponding category.

To be more concrete, let’s say you wanted to design a text classifier capable of distinguishing common topics of conversation, like the weather, movies, or food. In order to enable your text classifier to recognize discussion of the weather, you might tell it to look for weather-related words in the body of the text samples it is being fed. You’d have a list of keywords, phrases, and other relevant patterns that could be used to distinguish the topic. For instance, you might instruct the classifier to look for words like “wind”, “rain”, “sun”, “snow”, or  “cloud”. You could then have the classifier look through input text and count the number of times that these words appear in the body of the text and if they appear more commonly than words related to movies, you would classify the text as belonging to the weather class.

The advantage of rules-based systems is that their inputs and outputs are predictable and interpretable by humans, and they can be improved through manual intervention by the engineer. However, rules-based classification methods are also somewhat brittle, and they often have a difficult time generalizing because they can only adhere to the predefined patterns that have been programmed in. As an example, the word “cloud” could refer to moisture in the sky, or it could be referring to a digital cloud where data is stored. It’s difficult for rules-based systems to handle these nuances without the engineers spending a fair amount of time trying to manually anticipate and adjust for these subtleties.

Machine Learning Systems

As mentioned above, rules-based systems have limitations, as their functions and rules must be pre-programmed. By contrast, machine learning-based classification systems operate by applying algorithms that analyze datasets for patterns that are associated with a particular class.

Machine learning algorithms are fed pre-labeled/pre-classified instances that are analyzed for relevant features. These pre-labeled instances are the training data.

The machine learning classifier analyzes the training data and learns patterns that are associated with the different classes. After this, unseen instances are stripped of their labels and fed to the classification algorithm which assigns the instances a label. The assigned labels are then compared to the original labels to see how accurate the machine learning classifier was, gauging how well the model learned what patterns predict which classes.

Machine learning algorithms operate by analyzing numerical data. This means that in order to use a machine learning algorithm on text data, the text needs to be converted into a numerical format. There are various methods of encoding text data as numerical data and creating machine learning methods around this data. We’ll cover some of the different ways to represent text data below.

Bag-of-Words

Bag-of-words is one of the most commonly used approaches for encoding and representing text data. The term “bag-of-words” comes from the fact that you essentially take all the words in the documents and put them all into one “bag” without paying attention to word order or grammar, paying attention only to the frequency of words in the bag. This results in a long array, or vector, containing a single representation of all the words in the input documents. So if there are 10000 unique words total in the input documents, the feature vectors will be 10000 words long. This is how the size of the word bag/feature vector is calculated.

Photo: gk_ via Machinelearning.co, (https://machinelearnings.co/text-classification-using-neural-networks-f5cd7b8765c6)

After the feature vector size has been determined, every document in the list of total documents is assigned its own vector filled with numbers that indicate how many times the word in question appears in the current document. This means that if the word “food” appears eight times within one text document, that corresponding feature vector/feature array will have an eight in the corresponding position.

Put another way, all the unique words that appear in the input documents are all piled into one bag and then each document gets a word vector of the same size, which is then filled in with the number of times the different words appear in the document.

Text datasets will often contain a large number of unique words, but most of them aren’t used very frequently. For this reason, the number of words used to create the word vector is typically capped at a chosen value (N) and then the feature vector dimension will be Nx1.

Term Frequency-Inverse Document Frequency (TF-IDF)

Another way to represent a document based on the words in it is dubbed Term Frequency-Inverse Document Frequency (TF-IDF). A TF-IDF approach also creates a vector that represents the document based on the words in it, but unlike Bag-of-words these words are weighted by more than just their frequency. TF-IDF considers the importance of the words in the documents, attempting to quantify how relevant that word is to the subject of the document. In other words, TF-IDF analyzes relevance instead of frequency and the word counts in a feature vector are replaced by a TF-IDF score that is calculated with regard to the whole dataset.

A TF-IDF approach operates by first calculating the term frequency, the number of times that the unique terms appear within a specific document. However, TF-IDF also takes care to limit the influence that extremely common words like “the”, “or”, and “and”, as these “stopwords” are very common yet convey very little information about the content of the document. These words need to be discounted, which is what the “inverse-document frequency” part of TF-IDF refers to. This is done is because the more documents that a specific words shows up in, the less useful that word is in distinguishing it from the other documents in the list of all documents. The formula that TF-IDF uses to calculate the importance of a word is designed to preserve the words that are the most frequent and the most semantically rich.

The feature vectors created by the TF-IDF approach contain normalized values that sum to one, assigning each word a weighted value as calculated by the TF-IDF formula.

Word Embeddings

Word embeddings are methods of representing text that ensure that words with similar meanings have similar numerical representations.

Word embeddings operate by “vectorizing” words, meaning that they represent words as real-valued-vectors in a vector space. The vectors exist in a grid or matrix, and they have a direction and length (or magnitude). When representing words as vectors, the words are converted into vectors comprised of real values. Every word is mapped to one vector, and words that are similar in meaning have similar direction and magnitude. This type of encoding makes it possible for a machine learning algorithm to learn complicated relationships between words.

The embeddings that represent different words are created with regard to how the words in question are used. Because words that are used in similar ways will have similar vectors, the process of creating word embeddings automatically translates some of the meaning the words have. A bag of words approach, by contrast, creates brittle representations where different words will have dissimilar representations even if they are used in highly similar contexts.

As a result, word embeddings are better at capturing the context of words within a sentence.

There are different algorithms and approaches used to create word embeddings. Some of the most common and reliable word embedding methods include: embedding layers, word2vec, and GloVe.

Embedding Layers

One potential way to use word embeddings alongside a machine learning/deep learning system is to use an embedding layer. Embedding layers are deep learning layers that convert words into embeddings which is then fed into the rest of the deep learning system. The word embeddings are learned as the network trains for a specific text-based task.

In a word embedding approach, similar words will have similar representations and be closer to each other than to dissimilar words.

In order to use embedding layers, the text needs to be preprocessed first. The text in the document has to be one-hot encoded, and the vector size needs to be specified in advance. The one-hot text is then converted to word vectors and the vectors are passed into the machine learning model.

Word2Vec

Word2Vec is another common method of embedding words. Word2Vec uses statistical methods to convert words to embeddings and it is optimized for use with neural network based models. Word2Vec was developed by Google researchers and it is one of the most commonly used embedding methods, as it reliably yields useful, rich embeddings. Word2Vec representations are useful for identifying semantic and syntactic commonalities in language. This means that Word2Vec representations capture relationships between similar concepts, being able to distinguish that the commonality between “King” and “Queen” is royalty and that “King” implies “man-ness” while Queen implies “Woman-ness”.

GloVe

GloVE, or Global Vector for Word Representation, builds upon the embedding algorithms used by Word2Vec. GloVe embedding methods combine aspects of both Word2Vec and matrix factorization techniques like Latent Semantic Analysis. The advantage of Word2Vec is that it can capture context, but as a tradeoff it poorly captures global text statistics. Conversely, traditional vector representations are good at determining global text statistics but they aren’t useful for determining the context of words and phrases. GloVE draws from the best of both approaches, creating word-context based on global text statistics.

Spread the love
Continue Reading