Quantum Stat has released their “Big Bad NLP Database” in what is a big step forward for natural language processing (NLP). The database contains hundreds of different datasets for machine learning developers to utilize.
According to the company, they provide solutions to NLP and AI initiatives. They do this through services such as preprocessing to web app development, a multi-faceted approach that includes machine learning and deep neural networks, chatbot and dialogue management, and their new NLP database.
The company also conducts primary and secondary research to help individuals analyze the developments within the industries.
Central Hub of NLP Data
The decision to create the database, which is the world’s largest data library in natural language processing, came out of the need for a central hub to hold NLP data. The company aimed to make it more easily accessible and searchable than the alternative, which often requires researchers to search through multiple third-party libraries.
The company has been developing the database for a number of weeks; they currently have around 200 datasets. There are a variety of different datasets, not just the classics. The company has included those such as CommonCrawl and Penn Treebank.
Along with a range of different databases comes different NLP tasks. There are those that focus on classifying and question answering, but there are also datasets for text-to-SQL, speech recognition, and multi-modal.
Quantum Stat wants the database to be community-driven with contributions from users. The company has opened its doors for anyone to send a new dataset or recommend changes.
Another focus is to add datasets that diversify language, moving away from being strictly English. Their goal is to make the library more global and accessible to others.
Upon entering the “Big Bad NLP Database,” a user will be confronted with a clean and organized layout. The name of the dataset is listed, followed by the language and a detailed description. It also lists instances, format, task, year created, and the creator. Each database has a download link to follow.
One will encounter databases such as Historical Newspapers Daily World Time Series dataset, containing daily contents of newspapers in the US and UK from 1836 to 1922; SciQ Dataset, containing 13,679 crowdsourced science exam questions in the fields of Physics, Biology, and Chemistry; CommonCrawl, containing the data from 25 billion web pages; and MovieLens, a dataset containing 22,000,000 ratings and 580,000 tags for 33,000 movies by 240,000 users.
Quantum Stat’s impressive database comes at a time when researchers require larger and more diverse datasets due to advances in deep learning. Because of the massive amount of data contained within human language, each unique dataset makes it a little easier to process. The advancement of NLP relies on these databases, and Quantum Stat has contributed to quickening that advancement by gathering so many datasets in one space.
NLP will be important in many aspects of society. It can help predict diseases based on electronic health records and a patient’s speech, help companies find out what customers are saying about a product, and identify fake news in a world where it runs rampant.
The technology is advancing extremely rapidly, and it will not be long before it is capable of tackling these complex applications.