IBM scientist Peter Staar has developed an AI tool which is being used by more than 300 experts who are developing a treatment or vaccination for COVID-19.
To help researchers access structured and unstructured data quickly, IBM is offering a cloud-based AI research resource that has been trained on a corpus of thousands of more than 45,000 scientific papers contained in the COVID-19 Open Research Dataset (CORD-19), prepared by the White House and a coalition of research groups, and licensed databases from the DrugBank, Clinicaltrials.gov and GenBank.
Dr. Peter Staar joined the IBM Research – Zurich Laboratory in July of 2015 as a post-doctoral research fellow in the Foundations of Cognitive Solutions project. The Belgium-born scientist first came to IBM Research as a summer student in 2006.
You first joined the IBM Research – Zurich Laboratory in July of 2015. What types of projects have you worked on at IBM?
My initial research focused on applications for high performance computing and was part of the winning team for the prestigious ACM Gordon Bell award.
More recently around 2017 I started to focus on AI and in August 2018 my team published a paper at the ACM Conference on Knowledge Discovery and Data Mining (KDD 2018) on a massively scalable document ingestion system, which we called the Corpus Conversion Service. This AI-based cloud tool was able to ingest 100,000 PDF pages per day (even of scanned documents) with accuracy above 97 percent—and then train and apply advanced machine learning models that extract the content from these documents at a scale never achieved before. We are now applying this same technology to help researchers with COVID-19.
When did IBM first come across the idea of using Corpus Conversion Service to tackle the COVID-19 epidemic?
In mid-March the White House led an effort to publish more than 45,000 documents on the coronavirus and COVID-19. When we saw the corpus we quickly realised that our technology could help, not just to make the PDFs searchable, but to also combine the knowledge within those PDFs with additional datasets like Drugbank, GenBank and Clinicaltrials.gov. We went live with the service on 3 April.
How would you best describe what the Corpus Conversion Service is?
As with any large volume of disparate data sources, it is difficult to efficiently aggregate and analyse that data in ways that can yield scientific insights. We make this easier using a knowledge graph which finds connections between these data sources to potentially yield new knowledge.
Can you discuss the principal challenge of extracting data from PDF format into a searchable form?
According to Adobe, there are roughly 2.5 trillion Portable Document Format (PDF) files currently in circulation. Think of the knowledge these files contain: scientific articles, technical literature, and much more. But all that content is “dark” or unused, because until now, we have had no way to ingest large number of PDF files at scale and make their content useable (or structured).
PDF files often include combinations of vector graphics, text, and bitmap graphics, all of which make extraction of qualitative and quantitative data quite challenging. In fact, converting automatic content reconstruction has been a problem for over a decade. While many document conversion solutions are available, none of them address scalability or apply AI, which means that they need to rely on expensive human-based maintenance and upgrading.
To the best of our knowledge, the Corpus Conversion Service is the first comprehensive system to use advanced AI at this level of scalability. While existing solutions can only convert one document at a time to a desired output format, our tool can ingest entire collections, a corpus of documents, and build machine learned models on top of that.
How do you extract not only the text that is contained in a document but the structure?
A key element is that we designed the human-computer interaction in the system to allow very fast and massive annotation without any computer science knowledge. This swap to machine learning gives our service a great deal of flexibility, as it can adapt rapidly to certain templates of documents, achieve highly accurate results, and ultimately eliminate the costly and time-consuming tuning typical of traditional rule-based algorithms.
Can you discuss the challenges of building a machine learning model that can scale and respond quickly to hundreds and even potentially thousands of concurrent users?
We have developed the Corpus Conversion Service on top of state-of-the-art cloud services, such as OpenShift on IBM Cloud. This allows us to scale our application effortlessly with increased demand. The AI models we apply can therefore be used by many users concurrently.
How many documents have been ingested into the service?
We have several industrial clients using the tools, so we don’t know how many documents they have ingested as they each have their own IBM Cloud instance. But for COVID-19 we ingested all 45,826 papers from the White House.
How has the research community reacted to using this AI tool?
Since we announced the free availability of our tool we few weeks ago we have more than 400 users from over a dozen countries, most of them medical doctors and professors.
Is there anything else that you would like to share about either the Corpus Conversion Service and/or how it is used in the context of COVID-19?
One of our clients is Italian energy firm Eni who are using our technology for the exploration of hydrocarbons, which is a complex and knowledge-intensive business that involves various engineering and scientific disciplines working together.
At Eni, the knowledge is based on the processing of large amounts of geological, physical and geochemical data, which is then processed into a knowledge graph. Geoscientists can then use AI to contextualize and present relevant information, which will help them to improve decision making and the identification and verification of possible alternative exploration scenarios. More specifically, for Eni this means a more realistic and precise representation of the geological model.
Thank you for this very important interview, this will save researchers untold hours. Readers who wish to learn more about the technology should visit the Corpus Conversion Service website. Researchers should visit the COVID-19 AI tool page. Please note, access to this resource will be granted only to qualified researchers.
U.S. National Institutes of Health Turns to AI for Fight Against COVID-19
The National Institutes of Health has turned to artificial intelligence (AI) for diagnosis, treatment, and monitoring of COVID-19 through the creation of the Medical Imaging and Data Resource Center (MIDRC).
What is the MIDRC?
The MIDRC consists of multiple institutions working together, led by the National Institute of Biomedical Imaging and Bioengineering (NIBIB), which is part of NIH. The collaboration aims to develop new technologies that will help physicians detect the virus early and create personalized therapies for patients.
Bruce J. Tromberg, Ph.D., is Director of the NIBIB.
“This program is particularly exciting because it will give us new ways to rapidly turn scientific findings into practical imaging tools that benefit COVID-19 patients,” Tromberg said. “It unites leaders in medical imaging and artificial intelligence from academia, professional societies, industry, and government to take on this important challenge.”
One of the ways experts assess the severity of a COVID-19 case is by looking at the features of infected lungs and hearts on medical images. This can also help predict how a patient will respond to treatment and improve the overall outcomes.
The big challenge surrounding this method is that it’s difficult to rapidly and accurately identify these signatures and evaluate the information, especially when there are other clinical symptoms and tests.
Machine Learning Algorithms
The MIDRC aims to develop and implement new and effective diagnostics. One of these will be machine learning algorithms, which solve some of those issues. Machine learning algorithms can help physicians optimize treatment after accurately and rapidly assessing the disease.
Guoying Liu, Ph.D., is the NIBIB scientific program lead on the new approach.
“This effort will gather a large repository of COVID-19 chest images,” Liu explained, “allowing researchers to evaluate both lung and cardiac tissue data, ask critical research questions, and develop predictive COVID-19 imagining signatures that can be delivered to healthcare providers.”
Krishna Kandarpa, M.D., Ph.D., is director of research sciences and strategic directions at NIBIB.
“This major initiative responds to the international imagining community’s expressed unmet need for a secure technological network to enable the development and ethical application of artificial intelligence to make the best medical decisions for COVID-19 patients,” Kandarpa said. “Eventually, the approaches developed could benefit other conditions as well.”
Some of the other major names on this project include Maryellen L. Giger, Ph.D., who is taking the lead. She is Professor of Radiology, Committee on Medical Physics at the University of Chicago. Co-investigators include Etta Pisano, MD, and Michael Tikin, MS, from the American College of Radiology (ACR), Curtis Langlotz, MD, Ph.D., and Adam Flanders, MD, from the Radiological Society of North America (RSNA), and Paul Kinahan, Ph.D., from the American Association of Physicists in Medicine (AAPM).
Through collaborations between the ACR, RSNA, and AAPM, the MIDRC will work toward rapid collection, analysis, and dissemination of imagining and other clinical data.
While many believe that the adoption of AI for pandemic-related solutions is long overdue, the National Institutes of Health’s new MIDRC is a step in that direction. It is only a matter of time before AI plays a major role in the detection, response, and eventual prevention of global pandemic causing viruses.
Supply Chains after Covid-19: How Autonomous Solutions are Changing the Game
Early measures by the material handling industry to curb the coronavirus pandemic saw border and plant closures all over the world. While for machine and vehicle manufacturers in eastern Europe and China production is in full swing again, the rest of Europe, North America and other western countries are struggling to get back to their pre-Covid-19 production strength.
Restrictions in freight transport across Europe are still very noticeable and are causing bottlenecks in supply chains. The strict stay-at-home-orders imposed in most European countries to contain the pandemic have had and are having a major impact on industrial production as the personnel are simply missing on site.
Security measures like keeping minimum distance or wearing masks are proving to be an organizational challenge for many production facilities around the world. In order to be able to comply with the safety requirements, in many premises only half of the workforce is allowed on-site, or the production line is divided into shifts. This in turn is restricting the flow of goods. Even when components exist, they stockpile, and cannot be integrated due to a lack of staff or time for those on reduced activity.
After the crisis, the industry will face new challenges. There is already speculation about a trend moving away from globalization towards regionalization. It is not necessarily the sourcing of production that could be affected by a possible regionalization, but rather warehouse management. Regardless of restricted supply chains, access to material inventory is essential for every production line. As a lesson-learned from the Covid-19 crisis, we could see a move from large central warehouses to smaller regional warehouses.
The automotive industry, for instance, was hit hard by supply shortages due to restrictions stemming from the pandemic. Automotive OEMs and their suppliers have long and complex supply chains with many steps in the production process. After the experienced bottlenecks, their follow-up measures might include a diversification of suppliers, as well as the decentralization of inventories in order to maintain agility in case of a crisis.
This presupposes digitalization of warehouse management: if existing stockpiling data is used rationally, transparency in the entire supply chain can easily be created. This would mean everyone involved could use existing data to optimize their processes. This requires intelligent warehouse management systems (WMS) and intelligent solutions for material handling to work hand-in-hand.
Automated guided vehicles (AGVs) are not a novelty in in-house material handling processes but their evolution could hold the key to the industry’s future. Since their introduction, technologies in autonomous vehicles have developed rapidly, enabling the transport of people in complex environments. Bringing this level of intelligence to industrial vehicles hails the next era of logistics automation: new AGV generations accessing complex outdoor environments are a real game changer and could potentially become more attractive after the Covid-19 crisis. As these vehicles become increasingly deployed in dynamic environments without infrastructure, these technologies have quickly migrated from manufacturing applications to supporting warehousing for manufacturing and distribution.
The process automation in supply chains – part of the so-called Industry 4.0 – will play a significant role. It could allow companies to keep or even reduce overall logistics operational costs, and eventually maintain a minimal operational flow even in times of crisis.
Rethinking the industrial supply chain: intelligence is key
The autonomous tow tractor TractEasy by autonomous technology leader EasyMile is a perfect example of this new generation. It masters the automation of outdoor and intralogistics processes on factory premises, logistics centers and airports. The company is currently demonstrating the maturity of these autonomous tow tractors at automotive supplier Peugeot Société Anonyme (PSA)’s manufacturing plant in Sochaux, France. Operated by GEODIS, PSA is using the tractor to find opportunities to optimize costs in the flows on its site.
The impact of the ongoing crisis has revealed the fragility of existing supply chains. Companies are reassessing large and complex procurement networks. Ultimately, the Covid -19 pandemic is putting supply chains to the test, but global supply chains should be prepared for crises as part of risk management anyway. The sheer number of natural disasters in recent years has meant that the international supply chains have been repeatedly overhauled. From this point of view, the Covid-19 crisis is an example of unpredictability that supply chains have to adapt to in order to develop.
What is certain is that the industry is on an upward trend toward more sustainable and stable industrial ecosystems. Automation is a concept that will play a major role in these future considerations, from manufacturers to logistic operators across the globe.
Stefano Pacifico, and David Heeger, Co-Founders of Epistemic AI – Interview Series
Epistemic AI employs state-of-the-art Natural Language Processing (NLP), machine learning and deep learning algorithms to map relations among a growing body of biomedical knowledge, from multiple public and private sources, including text documents and databases. Through a process of Knowledge Mapping, users’ work interactively with the platform to map and understand subsets of biomedical knowledge, which reveals concepts and relationships and that are otherwise missed with traditional search.
We interviewed both Co-Founders of Epistemic AI to discuss these latest advances.
Stefano Pacifico comes from 10+ years in applied AI and NLP development. Formerly at Bloomberg, where he spent 7 years, and was at Elemental Cognition before starting Epistemic.
David Heeger is a Silver Professor of data science and neuroscience at NYU, and has spent his career bridging computer science, AI and bioscience. He is a member of the National Academy of Sciences. As founders they bring together the expertise of building applied large-scale AI and NLP systems for understanding large collections of knowledge, with expertise in computational biology and biomedical science from years of research in the area.
What is it that introduced and attracted you to AI and Natural Language Processing (NLP)?
Stefano Pacifico: When I was in college in Rome, and AI was not popular at all (in fact it was very fringe), I asked my then advisor what specialization I should have taken among those available. He said: “If you want to make money, Software Engineering and Databases, but if you want to be weird but very advanced, then choose Artificial Intelligence”. I was sold at “weird”. I then started working on knowledge representation and reasoning to study how autonomous agents could play soccer or rescue people. Then two realizations made me fall in love with NLP: first, autonomous agents might have to communicate with natural language among themselves! Second, building formal knowledge bases by hand is hard, while natural language (in text) already provides the largest knowledge base of all. I know today these might seem obvious observations, but they were not as mainstream before.
What was the inspiration behind launching Epistemic AI?
Stefano Pacifico: I am going to make a bold claim. Nobody today has adequate tooling to understand and connect the knowledge present in large, ever-growing collections of documents and data. I had previously worked on that problem in the world of finance. Think of news, financial statements, pricing data, corporate actions, filings etc. I found that problem intoxicating. And of course, it’s a difficult problem; and an important one! When I met my co-founder, Dr. David Heeger, we spent quite a bit of time evaluating startup opportunities in the biomedical industry. When we realized the sheer volume of information generated in this field, it’s as if everything fell in its right place. Biomedical researchers struggle with information overload, while attempting to grapple with the vast and rapidly expanding base of biomedical knowledge, including documents (e.g., papers, patents, clinical trials) and databases (e.g., genes, proteins, pathways, drugs, diseases, medical terms). This is a major pain point for researchers and, with no appropriate solution available, they are forced to use basic search tools (PubMed and Google Scholar) and explore manually-curated databases. These tools are suitable for finding documents matching keywords (e.g., a single gene or a published journal paper), but not for acquiring comprehensive knowledge about a topic area or subdomain (e.g., COVID-19), or for interpreting the results of high throughput biology experiments, such as gene sequencing, protein expression, or screening chemical compounds. We started Epistemic AI with the idea to address this problem with a platform that allows them to iteratively:
- Shorten the time to gather information and build comprehensive knowledge maps
- Surface cross-disciplinary information that can be otherwise difficult to find (real discoveries often come from looking into the white space between disciplines);
- Identify causal hypotheses by finding paths and missing links in your knowledge map.
What are some of both the public and private sources that are used to map these relations?
Stefano Pacifico: At this time, we are ingesting all the publicly available sources that we can get our hands on, including Pubmed and clinicaltrials.gov. We ingest databases of genes, drugs, diseases and their interactions. We also include private data sources for select clients, but we are not at liberty to disclose any details yet.
What type of machine learning technologies are used for the knowledge mapping?
Stefano Pacifico: One of the deeply held beliefs at Epistemic AI is that zealotry is not helpful for building products. Building an architecture integrating several machine learning techniques was a decision made early on, and those range from Knowledge Representation to Transformer models, through graph embeddings, but include also simpler models like regressions and random forests. Each component is as simple as it needs to be, but no simpler. While we believe to have already built NLP components that are state-of-the-art for certain tasks, we don’t shy away from simpler baseline models when possible.
Can you name some of the companies, non-profits, or academic institutions that are using the Epistemic platform?
Stefano Pacifico: While I’d love to, we have not agreed with our users to do so. I can say that we had people signing up from very high-profile institutions in all three segments (companies, non-profits, and academic institutions). Additionally, we intend to keep the platform free for academic/non-profit purposes.
How does Epistemic assist researchers in Identifying central nervous system (CNS) and other disease-specific biomarkers?
Dr. David Heeger: Neuroscience is a very highly interdisciplinary field including molecular and cellular biology and genomics, but also psychology, chemistry, and principles of physics, engineering, and mathematics. It’s so broad that nobody can be an expert at all of it. Researchers at academic institutions and pharma/biotech companies are forced to specialize. But we know that the important insights are interdisciplinary, combining knowledge from the sub-specialties. The AI-powered software platform that we’re building enables everyone to be much more interdisciplinary, to see the connections between their individual subarea of expertise and other topics, and to identify new hypotheses. This is especially important in neuroscience because it is such a highly interdisciplinary field to begin with. The function and dysfunction of the human brain is the most difficult problem that science has ever faced. We are on a mission to change the way that biomedical scientists work and even how they think.
Epistemic also enables the discovery of genetic mechanisms of CNS disorders. Can you walk us through how this works?
Dr. David Heeger: Most neurological diseases, psychiatric illnesses, and developmental disorders do not have a simple explanation in terms of genetic differences. There are a handful of syndromic disorders for which a specific mutation is known to cause the disorder. But that’s not typically the case. There are hundreds of genetic differences, for example, that have been associated with autism spectrum disorders (ASD). There is some understanding for some of these genes about the functions they serve in terms of basic biology. For example, some of the genes associated with ASD hold synapses together in the brain (note, however, that the same genes typically perform different functions in other organ systems in the body). But there’s very little understanding about how these genetic differences can explain the complex suite of behavioral differences exhibited by individuals with ASD. To make matters worse, two individuals with the same genetic difference may have completely different outcomes, one diagnosed with ASD and the other, not. And two individuals with completely different genetic profiles may have the same outcome with very similar behavioral deficits. To understand all this requires making the connection from genomics and molecular biology to cellular neuroscience (how do the genetic differences cause individual neurons to function differently) and then to systems neuroscience (how do those differences in cellular function cause networks of large numbers of interconnected neurons to function differently) and then to psychology (how do those differences in neural network function cause differences in cognition, emotion, and behavior). And all of this needs to be understood from a developmental perspective. A genetic difference may cause a deficit in a particular aspect of neural function. But the brain doesn’t just sit there and take it. Brains are highly adaptive. If there’s a missing or broken mechanism then the brain will develop differently to compensate as much as possible. This compensation might be molecular, for example, upregulating another synaptic receptor to replace the function of a broken synaptic receptor. Or the compensation might be behavioral. The end result depends not only on the initial genetic difference but also on the various attempts to compensate relying on other molecular, cellular, circuit, systems, and behavioral mechanisms.
No individual has the knowledge to understand all this. We all need help. The AI-powered software platform that we’re building enables everyone to collect and link all the relevant biomedical knowledge, to see the connections and to identify new hypotheses.
How are biopharma and academic institutions using Epistemic to tackle the COVID-19 challenge?
Stefano Pacifico: We have released a public version of our platform that includes COVID specific datasets and is freely accessible to anyone doing research on COVID-19. It is available at https://covid.epistemic.ai
What are some of the other diseases or genetic issues that Epistemic have been used for?
Stefano Pacifico: We have collaborated with autism researchers and are most recently putting together a new research effort for Cystic Fibrosis. But we are happy to collaborate with any other researchers or institutions that might need help with their research.
Is there anything else that you would like to share about Epistemic?
Stefano Pacifico: We are building a movement of people that want to change the way biomedical researchers work and think. We sincerely hope that many of your readers will want to join us!
Thank you both for taking the time to answer our questions. Readers who wish to learn more should visit Epistemic AI.
- Matt Carlson, VP Business Development at WiBotic – Interview Series
- U.S. National Institutes of Health Turns to AI for Fight Against COVID-19
- WiBotic Receives Industry-First FCC Approval for High Power Wireless Charging of Robots
- AI Browser Tools Aim To Recognize Deepfakes and Other Fake Media
- Dave Ryan, General Manager, Health & Life Sciences Business at Intel – Interview Series