Connect with us


Peter Staar, IBM Scientist, COVID-19 Open Research Dataset – Interview Series




IBM scientist Peter Staar has developed an AI tool which is being used by more than 300 experts who are developing a treatment or vaccination for COVID-19.

To help researchers access structured and unstructured data quickly, IBM is offering a cloud-based AI research resource that has been trained on a corpus of thousands of more than 45,000 scientific papers contained in the COVID-19 Open Research Dataset (CORD-19), prepared by the White House and a coalition of research groups, and licensed databases from the DrugBank, and GenBank.

Dr. Peter Staar joined the IBM Research – Zurich Laboratory in July of 2015 as a post-doctoral research fellow in the Foundations of Cognitive Solutions project. The Belgium-born scientist first came to IBM Research as a summer student in 2006.

You first joined the IBM Research – Zurich Laboratory in July of 2015. What types of projects have you worked on at IBM?

My initial research focused on applications for high performance computing and was part of the winning team for the prestigious ACM Gordon Bell award.

More recently around 2017 I started to focus on AI and in August 2018 my team published a paper at the ACM Conference on Knowledge Discovery and Data Mining (KDD 2018) on a massively scalable document ingestion system, which we called the Corpus Conversion Service. This AI-based cloud tool was able to ingest 100,000 PDF pages per day (even of scanned documents) with accuracy above 97 percent—and then train and apply advanced machine learning models that extract the content from these documents at a scale never achieved before. We are now applying this same technology to help researchers with COVID-19.

When did IBM first come across the idea of using Corpus Conversion Service to tackle the COVID-19 epidemic?

In mid-March the White House led an effort to publish more than 45,000 documents on the coronavirus and COVID-19. When we saw the corpus we quickly realised that our technology could help, not just to make the PDFs searchable, but to also combine the knowledge within those PDFs with additional datasets like Drugbank, GenBank and We went live with the service on 3 April.

How would you best describe what the Corpus Conversion Service is?

As with any large volume of disparate data sources, it is difficult to efficiently aggregate and analyse that data in ways that can yield scientific insights. We make this easier using a knowledge graph which finds connections between these data sources to potentially yield new knowledge.

Can you discuss the principal challenge of extracting data from PDF format into a searchable form?

According to Adobe, there are roughly 2.5 trillion Portable Document Format (PDF) files currently in circulation. Think of the knowledge these files contain: scientific articles, technical literature, and much more. But all that content is “dark” or unused, because until now, we have had no way to ingest large number of PDF files at scale and make their content useable (or structured).

PDF files often include combinations of vector graphics, text, and bitmap graphics, all of which make extraction of qualitative and quantitative data quite challenging. In fact, converting automatic content reconstruction has been a problem for over a decade. While many document conversion solutions are available, none of them address scalability or apply AI, which means that they need to rely on expensive human-based maintenance and upgrading.

To the best of our knowledge, the Corpus Conversion Service is the first comprehensive system to use advanced AI at this level of scalability. While existing solutions can only convert one document at a time to a desired output format, our tool can ingest entire collections, a corpus of documents, and build machine learned models on top of that.

How do you extract not only the text that is contained in a document but the structure?

A key element is that we designed the human-computer interaction in the system to allow very fast and massive annotation without any computer science knowledge. This swap to machine learning gives our service a great deal of flexibility, as it can adapt rapidly to certain templates of documents, achieve highly accurate results, and ultimately eliminate the costly and time-consuming tuning typical of traditional rule-based algorithms.

Can you discuss the challenges of building a machine learning model that can scale and respond quickly to hundreds and even potentially thousands of concurrent users?

We have developed the Corpus Conversion Service on top of state-of-the-art cloud services, such as OpenShift on IBM Cloud. This allows us to scale our application effortlessly with increased demand. The AI models we apply can therefore be used by many users concurrently.

How many documents have been ingested into the service?

We have several industrial clients using the tools, so we don’t know how many documents they have ingested as they each have their own IBM Cloud instance. But for COVID-19 we ingested all 45,826 papers from the White House.

How has the research community reacted to using this AI tool?

Since we announced the free availability of our tool we few weeks ago we have more than 400 users from over a dozen countries, most of them medical doctors and professors.

Is there anything else that you would like to share about either the Corpus Conversion Service and/or how it is used in the context of COVID-19?

One of our clients is Italian energy firm Eni who are using our technology for the exploration of hydrocarbons, which is a complex and knowledge-intensive business that involves various engineering and scientific disciplines working together.

At Eni, the knowledge is based on the processing of large amounts of geological, physical and geochemical data, which is then processed into a knowledge graph. Geoscientists can then use AI to contextualize and present relevant information, which will help them to improve decision making and the identification and verification of possible alternative exploration scenarios. More specifically, for Eni this means a more realistic and precise representation of the geological model.

Thank you for this very important interview, this will save researchers untold hours. Readers who wish to learn more about the technology should visit the Corpus Conversion Service website. Researchers should visit the COVID-19 AI tool page.  Please note, access to this resource will be granted only to qualified researchers. 

Antoine Tardif is a Futurist who is passionate about the future of AI and robotics. He is the CEO of, and has invested in over 50 AI & blockchain projects. He is the Co-Founder of a news website focusing on digital securities, and is a founding partner of unite.AI. He is also a member of the Forbes Technology Council.