Connect with us

Natural Language Processing

AI Startup Diffbot Reads Entire Public Internet To Pursue Fact-Based Text Generation




The recent advances in natural language processing and text generation accomplished by OpenAI through their GPT-2 and GPT-3 language models have been impressive, able to generate text that looks like it may have genuinely been written by a human. Unfortunately, although these models excel at writing natural-sounding text, they are not equipped to write text that is factual. Advanced language models cobble sentences together from words that make the most sense in context, without paying any attention to the veracity of the claims within the generated text. As reported by MIT technology review, a startup known as Diffbot aims to solve this problem by having an AI extract as many facts as it can from the internet.

Diffbot is a startup hoping to make AI more useful for practical text generation tasks like auto-populating spreadsheets and autocompleting sentences or code. In order for the text generated by the AI to be reliable, the AI itself needs to be trustworthy and it has to have some concept of factual vs. fictional statements. Diffbot’s approach to giving a text generation program the ability to generate factual statements begins by collecting massive amounts of text from practically the entire public web. Diffbot parses text in multiple languages and splits up text into sets of fact-based triplets, with the subject, object, and verb of a given fact being used to link one concept to another. For instance, it might represent facts regarding Bill Gates and Microsoft like this:

Bill Gates is the founder of Microsoft. Microsoft is a computer technology company.

Diffbot takes all of these short factoids and joins them together to create a knowledge graph. Knowledge graphs create webs of relationships between concepts, often along with a reasoner that assists in the creation of new conclusions based on these relationships. To put that another way, knowledge-graphs use data interlinking, and they can help machine learning algorithms to model knowledge domains. Knowledge graphs have actually been around for decades and many early AI researchers considered them important tools for allowing AI to understand the human world. However, knowledge graphs were typically created by hand which is a difficult,  pain-staking process. Automating the creation of knowledge graphs could allow AIs to attain a much greater, contextual understanding of concepts and produce text that is fact-based.

Google started using knowledge graphs a few years ago to aid in providing summaries of information when a popular topic is searched for. The knowledge graph is used to pull the most relevant factoids and represent them as a summary. Diffbot wants to do the same thing for every topic, not just the most popular ones. This requires building an absolutely massive knowledge graph, compiled by crawling the entire public web, something that only Google and Microsoft do otherwise. Diffbot scans the whole web and updates the knowledge graph with new information every four or five days, and over the course of a month it adds somewhere between 100 million to 150 million entries.

Diffbot doesn’t read the text of a website like normal web-crawlers, rather it uses computer vision algorithms to extract the raw pixels of a web page and pull video, image, article, and discussion data from the page. It identifies the key elements of the webpage and then extracts facts in a variety of languages, in adherence to the three-part factoid schema.

Currently, Diffbot offers both paid and free access to its knowledge graph. While researchers may access the graph for free, companies like DuckDuckGo and Snapchat use it to summarize text and extract snippets of trending news items. Meanwhile, Nike and Adidas utilize the platform to find sites selling counterfeit products, which is possible because Diffbot is able to ascertain which sites are actually selling shoes, not just having discussions about them.

In the future, Diffbot plans to expand its capabilities and add a natural-language interface to the platform, capable of answering almost any question you asked it and backing up those answers with sources. Ideally, the capabilities of Diffbot would be combined with a powerful language synthesis model like GPT-3.