Healthcare

Open Data Set on Covid-19 Released for Machine Learning

Published March 17, 2020

Updated April 26, 2026

Alex McFarland

The White House’s Office of Science and Technology Policy is asking researchers to analyze thousands of scholarly articles with artificial intelligence (AI) technology. All of the articles, which numbers around 29,000, could provide answers to questions about the coronavirus. About 13,000 of the articles in the database are in their entirety and are machine-readable. As for the other 16,000 articles, the database has partial text and summaries.

Over the last few days, U.S. government officials have worked with American tech companies and research institutions to secure legal permission to make the coronavirus papers available.

The open data set is known as the COVID-19 Open Research Dataset, or CORD-19. It will constantly add new information into one centralized hub, providing researchers and others with a single place to access it.

The partnership announced by the White House includes the Chan Zuckerberg Initiative, Microsoft Research, the Allen Institute for Artificial Intelligence, the National Institute of Health’s National Library of Medicine, Georgetown University’s Center for Security and Emerging Technology, Cold Spring Harbor Laboratory and the Kaggle AI platform, which is owned by Google.

According to U.S. CTO Michael Kratsios, the CORD-19 dataset is the “most extensive collection of machine readable coronavirus literature to date.”

The National Academy of Sciences, Engineering, and Medicine worked with the World Health Organization (WHO) to develop “high priority” questions. These questions revolve around the relationship between coronavirus and genetics, incubation, treatment, symptoms, and prevention.

Some of the research present in the database is pre-publication research pulled from resources such as medRxiv and bioRxiv. These are open access archives.

Cori Bargmann is the Chan Zuckerberg Initiative Head of Science.

“Sharing vital information across scientific and medical communities is key to accelerating our ability to respond to the coronavirus pandemic,” Bargmann said.

According to the Call to Action released by the White House, the database collection was developed through the use of Microsoft’s web-scale literature curation tools, which identified and brought together different scientific work from around the globe. The Chan Zuckerberg Initiative provided access to pre-publication content, the National Library of Medicine provided access to literature content, and the Allen AI team formatted the content so that it could be analyzed.

Dr. Eric Horvitz is Chief Scientific Officer at Microsoft.

“It’s all-hands on deck as we face the COVID-19 pandemic,” said Horvitz. “We need to come together as companies, governments, and scientists and work to bring our best technologies to bear across biomedicine, epidemiology, AI, and other sciences. The COVID-19 literature resource and challenge will stimulate efforts that can accelerate the path to solutions on COVID-19.”

Many are hoping that this approach works and provides a new way to utilize AI technology and machine learning in the future. One of those people is Dr. Dewey Murdick, Director of Data Science at Georgetown University’s Center for Security and Emerging Technology. Dr. Murdick helped coordinate the project.

“This valuable new resource is the fruit of unselfish collaboration and now offers the opportunity to find answers to important questions about COVID-19,” Dr. Murdick said. “Once the crisis has passed, we hope this project will inspire new ways to use machine learning to advance scientific research.”

If this project succeeds in proving much-needed answers about coronavirus, it could be used as a model in the future. AI technology is a powerful tool, and it can analyze the results of experts and institutions throughout the globe much faster than humans. This means a faster response time whenever a pandemic or other crisis breaks out, which could save many lives and prevent economic turmoil.

Alex McFarland

Alex McFarland is an AI journalist and writer exploring the latest developments in artificial intelligence. He has collaborated with numerous AI startups and publications worldwide.

Unite.AI

Open Data Set on Covid-19 Released for Machine Learning

Discover More