Why Hasn't Big Data Come to the Rescue in Clinical Data Unification?

Why Hasn't Big Data Come to the Rescue in Clinical Data Unification?

Contributed Commentary by Timothy Danford

June 17, 2015 | Contributed Commentary | The cost-to-value equation for standardizing clinical data is broken. Pharmaceutical companies spend millions of dollars annually, and substantially delay products’ time to market, sending clinical data to contractors for preparation and integration before analysis. In one experiment by the FDA, manually converting legacy data from about 100 new drug applications to a new standard format costs $7 million. John Keilty, General Manager at Third Rock Ventures and former Vice President of Informatics at Infinity Pharmaceuticals, recently estimated that as much as 80% of clinical data collected is left in older or legacy standards because companies typically only complete re-consolidation of exactly the new data that is required for compliance.

A complete view of all available clinical data would be incredibly useful for improving clinical analytics, simplifying cross-study comparisons, speeding future trials, and data mining for new indications. Mapping new data as it is created provides some of these benefits and reduces the costs and delays associated with relying on contractors to impose order on the data after the fact. However, the tangled web of clinical data standards and vocabularies makes reorganizing the existing data a daunting challenge, when added to the task of real-time integration for new datasets.

It’s a problem tailor-made for automated big data intervention. Why is the standardization process for clinical data still manual? Modern data techniques excel at replicating small functions across vast seas of data, but this challenge is quite different. Multiple versions of standard schemas, legacy data models, custom experimental domains, and locked or proprietary formats from previous contractor-created datasets are highly resistant to mass organization efforts. Organizing data at this scale, only to return to square one when a new data standard is released, is a serious risk for any data-conscious enterprise.

“Data lakes”—a helpful strategy for other high-volume data challenges—have also been proposed as one solution to the problem of clinical data. Data lakes excel at finding data through search and co-locating data for integration. However, finding the data is not the problem and co-location of datasets does not qualify as organization when it comes to clinical data integration. Automation alone can’t tackle the clinical data diversity problem.

The real pain point is bridging the manual data conversion process that stands between data collection and analysis. This bridge must be an ongoing investment rather than a one-time project, as even previously-standardized datasets must be re-organized and formatted when the target standards are themselves updated.

A long-term solution requires data unification capabilities co-located with the ability to adapt to changing standards for that data. One of the most promising advances in this area is a hybrid approach which marries the expertise of data standards experts and curators with the automation of big data, using the learning and adaptive abilities of machine learning algorithms.

These hybrid methods have the ability to operate at both a scale and accuracy that are unprecedented among existing data integration methodologies. For example, an international pharmaceutical company spending $2 billion per year on clinical data research recently used this hybrid machine learning approach, employing automated sourcing of internal expertise to unify datasets from thousands of scientists in labs spread across the globe. Their datasets targeted for integration were originally stored in tens of thousands of spreadsheets with more than 100,000 different attribute names; millions of rows; and inconsistencies in labeling, measurement units, and even language—all managed by 8,000 scientists with little ability or incentive to clean and share the information.

This company turned to researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) who had been working on an academic project with similar goals. The engagement with the MIT team began with a proof of concept using a combination of machine learning techniques as well as a panel of data experts across the organization to curate the data without requiring too much time from any single person or group. Initially, the system automatically analyzed and matched 86% of the source attributes—in a small number of rows. Then, using the expert-sourcing feature, 40 data scientists were recruited and helped further refine the matches. Over the next few weeks, the rate increased by almost another 10%.

While automated attribute matching is a major benefit, the project sponsors were most excited with the change in attitude of the data source experts. They began to care more about the quality and availability of their data. Expert sourcing allows data analysts to efficiently resolve conflicts by sending curation tasks to the people who produce the data. Resolving these conflicts become a part of the workflow, enabling real-time addition of new data to the general pool.

As the source inventory grows, data scientists, line of business managers and C-level executives will be able to analyze and compare global research projects to make fast, data-driven decisions that will cut the time and reduce the cost of new product development.

To maintain accuracy, the hybrid data integration method employs data experts who are able to answer difficult questions or make complicated decisions that are unavailable to automated algorithms. To improve efficiency, the machine learning uses the experts’ answers to selected questions to improve automated data unification as well as decisions about who to ask related questions about data matching in the future. Organizations spend less time preparing data—even as they scale up the number of sources to hundreds or thousands—and more time competing on analytics.

Over time, the automated methods are able to learn from the answers to hard questions, and apply those answers to the entire data integration workflow. This means that the marginal cost of integration for new data drops, and it speeds the process of re-standardizing existing or legacy datasets when the standards themselves change.

The very practical savings of capital and time to market, combined with opportunities to improve findings and explore new possibilities by using all available clinical data, give pharmaceutical companies a strong incentive to stop relying on error-prone, slow and costly contractor-executed data transformation. Machine learning has provided a path to bring the best of human institutional knowledge to automation of clinical data integration, offering improved accuracy, speed and cost.

Timothy Danford is a field engineer for Tamr working on advanced automation approaches to Big Data Variety in the pharmaceutical and healthcare industries. He earned his PhD in computer science from MIT.