Scientists at the Icahn School of Medicine at Mount Sinai have developed a new, automated, artificial intelligence (AI)-based algorithm that can read and predict patient data from electronic health records (EHRs).
The new method is called Phe2vec, and it can accurately identify patients with certain diseases. It was demonstrated to be just as accurate as the most popular traditional method, which requires more manual labor to perform.
Benjamin S. Glicksberg, PhD, is Assistant Professor of Genetics and Genomic Sciences. He is also a member of the Hasso Plattner Institute for Digital Health at Mount Sinai (HPIMS) and a senior author of the study.
“There continues to be an explosion in the amount and types of data electronically stored in a patient’s medical record. Disentangling this complex web of data can be highly burdensome, thus slowing advancements in clinical research,” said Glicksberg. “In this study, we created a new method for mining data from electronic health records with machine learning that is faster and less labor intensive than the industry standard. We hope that this will be a valuable tool that will facilitate further, and less biased, research in clinical informatics.”
The study, which was published in the journal Patterns, was led by Jessica K. De Freitas, a graduate student in Dr. Glicksberg’s lab.
Current Industry Standard
Scientists currently rely on established computer programs and algorithms to extract medical records for new information. A system called the Phenotype Knowledgebase (PheKB) develops and stores these algorithms. The system is highly effective at correctly identifying a patient diagnosis, but researchers are required to to go through many medical records and look for pieces of data first. This data includes things like lab tests and prescriptions.
The algorithm is then programmed to guide the computer to search for patients who have disease-specific pieces of data, which is labeled a “phenotype.” This enables the system to create a list of patients, which then needs to be manually checked by the researchers. If the researchers want to study a new disease, they are required to start the process over.
The New Method
With the new method, the researchers enable the computer to self-learn how to spot disease phenotypes, which saves the researchers time and work. The Phe2vec method was based on previous studies the team carried out.
Riccardo Miotto, PhD, is a former Assistant Professor at the HPIMS and a senior author of the study.
“Previously, we showed that unsupervised machine learning could be a highly efficient and effective strategy for mining electronic health records,” said Miotto. “The potential advantage of our approach is that it learns representations of diseases from the data itself. Therefore, the machine does much of the work experts would normally do to define the combination of data elements from health records that best describes a particular disease.”
The computer was programmed to go through millions of electronic health records and learn how to identify connections between data and diseases. The programming relied on “embedding” algorithms, which were also previously developed by the researchers. These were used to study word networks in various languages.
One of those algorithms was called word2vec, and it was especially effective. The computer was then programmed to identify the diagnosis of around 2 million patients whose data was stored in the Mount Sinai Health System.
The researchers then compared the effectiveness of the new and old systems, and they found that for nine out of ten diseases tested, the new Phe2vec system was just as effective, or slightly better, than the current “gold standard” phenotyping process for identifying diagnosis from EHRs. These diseases could include dementia, multiple sclerosis, sickle cell anemia, and more.
“Overall our results are encouraging and suggest that Phe2vec is a promising technique for large-scale phenotyping of diseases in electronic health record data,” Dr. Glicksberg said. “With further testing and refinement, we hope that it could be used to automate many of the initial steps of clinical informatics research, thus allowing scientists to focus their efforts on downstream analyses like predictive modeling.”