Pulling Value Out Of Data With Natural Language Processing
By Deborah Borfitz
March 20, 2019 | Natural language processing (NLP) is the preferred means for a growing number of cancer centers to turn unstructured and structured text into smart data by developing workflows automating the capture of information useful for downstream clinical research—reducing to nanoseconds what would have taken hours to do manually. Many cancer centers are participating in a data-sharing oncology consortium to enhance access to larger datasets for research and discovery. At least two of the consortium members are using the I2E natural language processing engine of Linguamatics to do the otherwise laborious extraction work.
City of Hope Comprehensive Cancer Center processes thousands of records for the purpose of automatically extracting data into discrete data elements, says Vice President of Research Informatics Samir Courdy. Admittedly, the front-end collection of specific data elements for reporting up to the consortium is still “somewhat of a manual process.”
Courdy joined City of Hope earlier this year after two decades at the University of Utah’s Huntsman Cancer Institute, where as chief research informatics officer (CRIO) he implemented Linguamatics I2E to establish NLP rules enabling more efficient data capture. He plans to create a similar workflow at City of Hope, he says, where NLP queries will automatically abstract data from text documents such as pathology and radiology reports as well as physician’s clinical notes into disease-specific registries based on the specific data elements of interest to their associated clinical research team.
The long-term goal is to train algorithms to look for certain data points, mirroring the gold-standard collection criteria established by disease experts while recognizing the nuanced differences in semantics contained in text describing the same diagnosis, or types of tests, among other clinically relevant information.
Courdy says he developed a proof of concept with Linguamatics after running across the technology at a conference eight years ago. Linguamatics was then adopted as a tool for improving the speed at which data is captured and made available to researchers. Hearing Courdy present his findings at a big data symposium in the San Francisco Bay Area, a colleague from the audience proposed the idea of an inter-institutional project involving both Huntsman Cancer Institute and City of Hope.
Courdy’s group collaboratively shared and refined their data queries for malignancies in Hodgkin’s lymphoma to help train NLP algorithms to better process and analyze data to meet the needs of individual research centers, he explains. NLP was an attractive alternative to hiring data extractors to pull information from free-text documents, which under the best of circumstances tends to introduce errors.
I2E super users at City of Hope include informaticists, computational biologists, data scientists and statisticians, says Courdy, but the system is intuitive enough that basic scientists and lab researchers can be trained to use it. Datasets are shared with investigators in the aggregate, to protect patient-identifiable information, either in an Excel spreadsheet or a data file format that can be further analyzed. The information is segregated by categories, as defined by the research question, which could include age groups, sex, cancer staging or grading, diagnostic markers, therapeutics, and pre- and post-treatments.
“Natural language processing has a role to play in precision medicine because data is where value is derived from,” says Courdy. “It will help you find the answers to the questions you are asking and improve processes, outcomes, and care because you’re learning from the data you’re collecting.” While much of the “magic” happens downstream, NLP in real time also helps frame questions and, with other data science disciplines such as machine learning and artificial intelligence, can be used to predict adverse events.
Huntsman Cancer Institute, during Courdy’s time there as CRIO, developed NLP algorithms for hematologic malignancies and cancers of the prostate, breast and pancreas—as well as the symptom management component for all cancer types, he says. NLP also helped identify process improvements with manual data extraction.
In December, Courdy presented on some of this work at the American Society of Hematology’s annual meeting in San Diego. A poster on the feasibility and accuracy of extracting diagnostic data from bone marrow biopsy reports of myeloid neoplasms using an NLP algorithm is pending journal publication. A recently published case study also describes how Linguamatics I2E was used at Huntsman to collect data faster and provide its researchers with higher quality data.
All of this plays a role in identifying patients for oncology clinical trials, where recruitment has traditionally been dismally low. Given that it takes a person between eight and 12 hours to abstract one patient record—and possibly longer, depending on the patient’s journey and treatment length and complexity—the value yielded by NLP over the long term should be notable, Courdy says.