Connect with us

Natural Language Processing

AI Startup Diffbot Reads Entire Public Internet To Pursue Fact-Based Text Generation




The recent advances in natural language processing and text generation accomplished by OpenAI through their GPT-2 and GPT-3 language models have been impressive, able to generate text that looks like it may have genuinely been written by a human. Unfortunately, although these models excel at writing natural-sounding text, they are not equipped to write text that is factual. Advanced language models cobble sentences together from words that make the most sense in context, without paying any attention to the veracity of the claims within the generated text. As reported by MIT technology review, a startup known as Diffbot aims to solve this problem by having an AI extract as many facts as it can from the internet.

Diffbot is a startup hoping to make AI more useful for practical text generation tasks like auto-populating spreadsheets and autocompleting sentences or code. In order for the text generated by the AI to be reliable, the AI itself needs to be trustworthy and it has to have some concept of factual vs. fictional statements. Diffbot’s approach to giving a text generation program the ability to generate factual statements begins by collecting massive amounts of text from practically the entire public web. Diffbot parses text in multiple languages and splits up text into sets of fact-based triplets, with the subject, object, and verb of a given fact being used to link one concept to another. For instance, it might represent facts regarding Bill Gates and Microsoft like this:

Bill Gates is the founder of Microsoft. Microsoft is a computer technology company.

Diffbot takes all of these short factoids and joins them together to create a knowledge graph. Knowledge graphs create webs of relationships between concepts, often along with a reasoner that assists in the creation of new conclusions based on these relationships. To put that another way, knowledge-graphs use data interlinking, and they can help machine learning algorithms to model knowledge domains. Knowledge graphs have actually been around for decades and many early AI researchers considered them important tools for allowing AI to understand the human world. However, knowledge graphs were typically created by hand which is a difficult,  pain-staking process. Automating the creation of knowledge graphs could allow AIs to attain a much greater, contextual understanding of concepts and produce text that is fact-based.

Google started using knowledge graphs a few years ago to aid in providing summaries of information when a popular topic is searched for. The knowledge graph is used to pull the most relevant factoids and represent them as a summary. Diffbot wants to do the same thing for every topic, not just the most popular ones. This requires building an absolutely massive knowledge graph, compiled by crawling the entire public web, something that only Google and Microsoft do otherwise. Diffbot scans the whole web and updates the knowledge graph with new information every four or five days, and over the course of a month it adds somewhere between 100 million to 150 million entries.

Diffbot doesn’t read the text of a website like normal web-crawlers, rather it uses computer vision algorithms to extract the raw pixels of a web page and pull video, image, article, and discussion data from the page. It identifies the key elements of the webpage and then extracts facts in a variety of languages, in adherence to the three-part factoid schema.

Currently, Diffbot offers both paid and free access to its knowledge graph. While researchers may access the graph for free, companies like DuckDuckGo and Snapchat use it to summarize text and extract snippets of trending news items. Meanwhile, Nike and Adidas utilize the platform to find sites selling counterfeit products, which is possible because Diffbot is able to ascertain which sites are actually selling shoes, not just having discussions about them.

In the future, Diffbot plans to expand its capabilities and add a natural-language interface to the platform, capable of answering almost any question you asked it and backing up those answers with sources. Ideally, the capabilities of Diffbot would be combined with a powerful language synthesis model like GPT-3.

Spread the love

Blogger and programmer with specialties in Machine Learning and Deep Learning topics. Daniel hopes to help others use the power of AI for social good.


Robert Weissgraeber, CTO & Managing Director at AX Semantics – Interview Series




Robert Weissgraeber is the Managing Director and CTO of AX Semantics, where he heads up product development and engineering. Robert is an in-demand speaker and an author on topics including agile software development and Natural Language Generation (NLG) technologies and a member of the Forbes Technology Council. He was previously Chief Product Officer at aexea and studied Chemistry at the Johannes Gutenberg University and did a research stint at Cornell University.

What initially attracted you to the space of Natural Language Generation (NLG)?

Writing and the way it has traditionally been executed, has not seen significant innovation since the advent of the typewriter 200 years ago and the word processor in the late 1960s. Three things attracted me to the NLG sector. First, witnessing the challenges and hardships people face and continuously endure with having to create vast quantities of content and text. For example, there are literally people who work in e-commerce that have to write hundreds of similar, yet unique t-shirt and clothing descriptions every month as new products come in. The amount of people needed to do this is astronomical, time-consuming, costly and impossible to scale. I knew the ability to utilize AI to automate content generation (vs. trying to produce content manually) would be a game changer for many industries who must regularly create mass volumes of content — not only in English but also many other languages.

Second, seeing the type of ‘low tech’ solutions others brought to the market — like spinning tools or poorly implemented NLG tools with ‘Enterprise UX’ — only solidified my attraction to the power of NLG.

Lastly, I wanted to work on something that wasn’t a rendition of the next online shop or the next “Uber for X”, but something capable of solving a really hard tech problem while also creating a solution for a real-world challenge. A perfect NLG solution with the ability to redefine content generation for the digital age can reduce ‘noise’ for all humanity, since it allows for super-precise communication.

Could you discuss some of the NLG solutions that are offered by AX Semantics?

AX Semantics is a 100% SaaS-based NLG solution with an easy to use UI (user interface). Customers build their own content generation machine by configuring their business application with our NLG tool, which automates content in 110 languages in a matter of minutes — including cross-data generation such as Chinese text from English data. As a result, companies can take data and information and create unique content rapidly and at scale regardless of perpetual business and cultural shifts.

There are a myriad of use cases for NLG technology. Different industries use it to solve content challenges unique to their sectors:

  • E-commerce: Most customers use our NLG software to generate large volumes of unique product descriptions (critical for SEO), category content or personalized emails like basket dropout recovery emails.
  • Brand/Customer Communications, including Social Media: Brands and content agencies use NLG to keep a steady flow of fresh blog content, or to create and populate unique social content across multiple social media channels — and can do so in 110 languages.
  • Media or ‘Robot Journalism’: Publishers use our NLG software for election reporting or data-based journalism such as pollution-level monitoring, stock table earnings, sports scores and crime blotters — freeing up journalists to work on more creative, engaging journalism or hard-hitting investigative stories. In many ways, content generation software is helping to revive local journalism, particularly for cash strapped small newspapers. Journalism has been in a tight spot since 2000 as newspapers have cut reporters and editors or shut down entirely. NLG is actually an unlikely ally in the push to save journalism.
  • Financial Services/Banking: Financial analysts, brokers, and executives face the demand to quickly update the content required by state and federal laws and regulations, such as details about investment plans, risk assessments, and financial filings — all of which must be updated regularly. Our NLG solution addresses the pain point of recurring financial reports, regulatory filings, executive summaries, and other written communication – all of which typically require massive amounts of financial data from disparate sources to be gathered, analyzed and translated into text customized for a broad range of audiences and languages. Banking and finance employees can effortlessly turn mountains of data into real-time actionable written narratives, create reports, descriptions of terms and loans, draft regulatory filings, and documents detailing investments — in more than 110 languages — all with minimal training — freeing up bandwidth for higher-value activities and responsibilities.
  • Pharmaceutical: Pharma companies use our HIPAA-compliant NLG software to generate regulatory Clinical Study Reports (CSRs) on medications up to 40% faster, by automating 30% of writing the CSR. This is crucial because the most challenging phase of bringing a drug to market is the human drug trial, or Phase III, during which time, clinicians must write a CSR that describes the pharmacological impacts and trial outcomes. Typically, data collected from the human drug trials is gathered and medical writing teams manually compile the report, however, this outdated, onerous and time-consuming process can potentially delay life-saving medications from coming to market sooner and cost pharmaceutical companies millions. A capacity challenge also exists. Writing a CSR report typically takes several months to complete, which limits the number of CSRs a team of medical writers can produce annually.

A writer’s voice is considered important in journalism and other types of writing, can you discuss the importance of drafting “personalities” for content generated by NLG?

With ‘data-to-text’ solutions like our NLG pipeline approach (in contrast to text-to-text, corpus-based stuff like GPT- 2/3), the writer is an essential and critical part of the creative process. The writer configures a lot of meaning levels between the data and adds ‘micro-templates’ for all aspects of the text, which allows the machine to select and combine all aspects in a ‘bag of words’ approach.

Can you discuss what hybrid content generation is and how employees can take advantage of this?

Hybrid content generation is where human and machine work together. Each actor focuses on the aspect they do best. Humans prioritize the creative part, writing style and specific content selection, including curation and definitions. The machine takes care of production, grammatical correctness building and scaling the content.

Hybrid content born from a partnership between man and machine fills a pressing need for fresh, vital content around the clock. Our software generates content that is almost indistinguishable from a human writer. Employees can use content generation to create new content that can be changed and updated at a moment’s notice. Working with content generation software allows them to not only fulfill, but exceed their job requirements and expectations.

With content being able to be created on the fly how will content quality be quantified?

Ultimately, it will be measured and quantified based on results. Early on, we had discussions about measuring quality aspects that were subjective to personal meaning, i.e. “I like this phrase or that,” etc., but this can now be measured objectively since scaled A/B testing is now possible. One of our customers, for example, struggled between the decision to use formal or informal languages (some non-English languages have that codified), and were able to test this out to see what worked best.

Can you share your opinions on how content will become decentralized?

No one source or market has a monopoly on content generation or the ability to scale content anymore. That’s the power of NLG technology and software — it delivers equal opportunity and equal access to companies large and small.

With access to valid data sources from all over the world along with NLG and publishing technology, businesses of all sizes can build scaled content, and adapt it to their own needs while using NLG technology to keep it continuously updated. For example, a customer service department can produce their own product descriptions with a focus on service-specific content, or an online marketer can tailor their content to be sales oriented — all without added maintenance or cost.

What are going to be some of the new “unlocked” business opportunities from this type of content generation?

First of all, we’re going to see totally new types of hyper-personalized content where lots of data sources are combined to produce content for a specific individual, such as a weather report that accounts for someone’s travel itinerary or financial services with individualized fund reporting. Second, as companies increasingly embrace the digital age, they’ll be able to utilize automated content generation to create a more robust online presence for their business.

Could you discuss some of the potential aspects for social good from NLG?

Consumers are inundated with a continuous stream of communication — texts, emails, countless ads and promotions — across multiple channels, including their mobile devices, computers and even the mail, they need to scroll through and read to find the information they want and need.

NLG allows for individualized, precise and noise-reduced communication. Imagine receiving only newsletters or reports that take your personal information into account and adapt it to your needs. NLG provides a better, more thoughtful way to reach customers in a way that matters to them.

What are some enterprises that are currently using AX Semantics?

Approximately three years ago after validating our solution with select clients, we began to introduce our NLG solution to the mass market. We had hundreds of customers try our software and build their individualized solution with AX Semantics. We then fine-tuned the necessary learning materials and onboarding process. A lot of those initially small customers now have scaled their content needs with us, including companies like Deloitte, Adidas, Nestlé, Otto and Beiersdorf.

Is there anything else that you would like to share about AX Semantics?

We’re very proud of the fact AX Semantics received recognition as one of the world’s top five providers of natural language generation platforms by Gartner, and that we were named a top emerging company in the NLG market by Forrester.

Lastly, in addition to our own client base and sales, we’re looking for companies that want to build their own vertical use cases on top of our technology, and we are actively supporting those companies with training, etc. So if you are building a new product or company and want to use content generation we’d love to speak with you.

Thank you for the great interview and the detailed answers regarding NLG, readers who wish to learn more should visit AX Semantics

Spread the love
Continue Reading

Natural Language Processing

How Language Processing is Being Enhanced Through Google’s Open Source BERT Model




BERT Search Enhancements

Bidirectional Encoder Representations from Transformers, otherwise known as BERT; is a training model that has drastically improved the efficiency and effect of NLP models. Now that Google has made BERT models open source it allows for the improvement of NLP models across all industries. In the article, we take a look at how BERT is making NLP into one of the most powerful and useful AI solutions in today’s world. 

Applying BERT models to Search

Google’s search engine is world-renowned for its ability to present relevant content and they have made this natural language processing program open source to the world.

The ability of a system to read and interpret natural language is becoming more and more vital as the world exponentially produces new data. Google’s library of word meanings, phrases, and general ability to present relevant content, is OPEN SOURCE. Beyond natural language processing, their BERT model has the ability to extract information from large amounts of unstructured data and can be applied to create search interfaces for any library. In this article, we will see how this technology can be applied in the energy sector. 

BERT (Bidirectional Encoder Representations from Transformers) is a pre-training approach proposed by the Google AI Language group, developed to overcome a common issue of early NLP models: the lack of sufficient training data.

Let us elaborate, without going into too much detail:

Training Models

Low-level (e.g. named entity recognition, topic segmentation) and high-level (e.g. sentiment analysis, speech recognition) NLP tasks require task-specific annotated datasets. While they are hard to come by and expensive to assemble, labeled datasets play a crucial role in the performance of both shallow and deep neural network models. High-quality inference results could only be achieved when millions or even billions of annotated training examples were available. And that was a problem that made many NLP tasks unapproachable. That is until BERT was developed.

BERT is a general-purpose language representation model, trained on large corpora of unannotated text. When the model is exposed to large amounts of text content, it learns to understand context and relationships between words in a sentence. Unlike previous learning models that only represented meaning at a word level (bank would mean the same in “bank account” and “grassy bank”), BERT actually cares about context. That is, what comes before and after the word in a sentence. Context turned out to be a major missing capability of NLP models, with a direct impact on model performance. Designing a context-aware model such as BERT is known by many as the beginning of a new era in NLP.

Training BERT on large amounts of text content is a technique known as pre-training. This means that the model’s weights are adjusted for general text understanding tasks and that more fine-grained models can be built on top of it. The authors have proved the superiority of such a technique when they employed BERT-based models on 11 NLP tasks and have achieved state-of-the-art results.

Pre-Trained Models

The best thing is: pre-trained BERT models are open source and publicly available. This means that anyone can tackle NLP tasks and build their models on top of BERT. Nothing can beat that, right? Oh, wait: this also means that NLP models can now be trained (fine-tuned) on smaller datasets, without the need of training from scratch. The beginning of a new era, indeed.

These pre-trained models help companies cut down the cost and time to deploy for NLP models to be used internally or externally. The effectiveness of well-trained NLP models is emphasized by Michael Alexis, CEO of virtual team-culture building company, 

“The biggest benefit of NLP is the scalable and consistent inference and processing of information.”   – Michael Alexis CEO of

Michael states how NLP can be applied to culture fostering programs such as icebreakers or surveys. A company can gain valuable insight into how company culture is doing by analyzing the responses of employees. This is achieved not only through just analyzing text but analyzing the annotation of text. Essentially the model also “reads between the lines” to draw inferences on emotion, feel, and overall outlook. BERT can aid in situations such as this one by pretraining models with a basis of indicators that it can go off to uncover the nuances of language and provide more accurate insights.  

Improving  queries

The capability to model context has turned BERT into an NLP hero and has revolutionized Google Search itself. Below is a quote from the Google Search product team and their testing experiences, while they were tuning BERT to understand the intent behind a query.

“Here are some of the examples that demonstrate BERT’s ability to understand the intent behind your search. Here’s a search for “2019 brazil traveler to USA needs a visa.” The word “to” and its relationship to the other words in the query are particularly important to understanding the meaning. It’s about a Brazilian traveling to the U.S. and not the other way around. Previously, our algorithms wouldn’t understand the importance of this connection, and we returned results about U.S. citizens traveling to Brazil. With BERT, Search is able to grasp this nuance and know that the very common word “to” actually matters a lot here, and we can provide a much more relevant result for this query.”
Understanding searches better than ever before, by Pandu Nayak, Google Fellow and Vice Presient of Search.

BERT Search example

BERT search example, before and after. Source blog

In our last piece on NLP and OCR, we have illustrated some NLP uses in the real-estate sector. We have also mentioned how “NLP tools are ideal information extraction tools”. Let us look at the energy sector and see how disruptive NLP technologies like BERT enable new application use cases. 

NLP models can extract information from large amounts of unstructured data

One way in which NLP models can be used is for the extraction of critical information from unstructured text data. Emails, journals, notes, logs, and reports are all examples of text data sources that are part of businesses’ daily operations. Some of these documents may prove crucial in organizational efforts to increase operational efficiency and reduce costs. 

When aiming to implement wind turbine predictive maintenance, failure reports may contain critical information about the behavior of different components. But since different wind turbine manufacturers have different data collection norms (i.e. maintenance reports come in different formats and even languages), manually identifying relevant data items could quickly become expensive for the plant owner. NLP tools can extract relevant concepts, attributes, and events from unstructured content. Text analytics can then be employed to find correlations and patterns in different data sources. This gives plant owners the chance to implement predictive maintenance based on quantitative measures identified in their failure reports.

NLP models can provide natural language search interfaces

Similarly, geoscientists working for oil and gas companies usually need to review many documents related to past drilling operations, well logs, and seismic data. Since such documents also come in different formats and are usually spread across a number of locations (both physical and digital), they waste a lot of time looking for the information in the wrong places. A viable solution in such a case would be an NLP-powered search interface, which would allow users to look up data in natural language. Then, an NLP model could correlate data across hundreds of documents and return a set of answers to the query. The workers can then validate the output based on their own expert knowledge and the feedback would further improve the model. 

However, there are also technical considerations for deploying such models. One aspect would be that industry-specific jargon can confuse traditional learning models that do not have the appropriate semantic understanding. Secondly, the models’ performance may be affected by the size of the training dataset. This is when pre-trained models such as BERT can prove beneficial. Contextual representations can model the appropriate word meaning and remove any confusion caused by industry-specific terms. By using pre-trained models, it is possible to train the network on smaller datasets. This saves time, energy, and resources that would have otherwise been necessary for training from scratch.

What about your own business? 

Can you think of any NLP tasks that might help you cut down on costs and increase operational efficiency?

The Blue Orange Digital data science team is happy to tweak BERT for your benefit too!

Spread the love
Continue Reading

Natural Language Processing

Quantum Stat’s Newest Creation is the NLP Model Forge



Image: NLP Model Forge

Here at Unite.AI, we have already covered the release of Quantum Stat’s “Big Bad NLP Database,” as well as its NLP Colab Repository. The tech company’s newest creation is its NLP Model Forge, which is a database and code template generator for 1,400 NLP models.

According to the company, “It’s the most diverse line-up around right now for developers!”

What is the NLP Model Forge

Quantum Stat has set out to achieve fast prototyping by “streamlining an inference pipeline on the latest fine-tuned NLP model.” 

One of the issues surrounding prototyping is that it can be time-consuming. This is due to the high amount of different model architectures and NLP libraries available on the market. In order to address this, the NLP Model Forge was developed. 

The NLP Model Forge’s 1,400 fine-tuned models were curated from some of the top NLP research companies like Hugging Face, Facebook (ParlAI), DeepPavlov and AI2. It consists of finely-tuned code for pre-trained models, which spans across several tasks such as classic text classification, text-to-speech and commonsense reasoning.

The developer is able to select several models at a time, and the process is simple and clear. By clicking a button on the Forge, the developer will be met with generated code templates that are ready to run and be pasted in a Colab notebook.

A developer can easily create inference API since the code blocks are formatted in batch and python programming scripts. 

The current tasks that are available in the Forge include: Sequence Classification, Text Generation, Question Answering, Token Classification, Summarization, Natural Language Inference, Conversational AI, Machine Translation, Text-to-Speech and Commonsense Reasoning. 

Metadata Descriptions

According to Quantum Stat, the best features of the Forge are its diversity of architectures, languages and libraries, as well as the meta descriptions of each model.

The metadata descriptions help guide a developer through their chosen model and the different tasks.

Quantum Stat’s post about the release of the Forge details how to generate code blocks to run inference on the models, which is a simple and straightforward process. The generated code blocks are programmatically labeled with relevant metadata, and this improves interpretation in the functionality of each model.

After this, the developer has the choice to edit the code right on the webpage, email the code or copy each code block to the clipboard so that it can be pasted in a local machine.

The other choice is to click on the “Colab” button, which allows a developer to copy all code blocks and the page and open a new Colab page.

Quantum Stat’s NLP Model Forge is just one of the company’s newest impressive releases. The database and code template generator is an important tool for developers, and it’s format makes it easy to access. The database will play a big role in reducing the time-consuming task of prototyping.


Spread the love
Continue Reading