Is Anonymized Data Secure? Not So Fast, Study Says

Is Anonymized Data Secure? Not So Fast, Study Says

By Benjamin Ross

August 23, 2019 | Policies such as the European Union (EU)'s General Data Protection Regulation (GDPR) have forced companies to reevaluate how they handle the personal data of their clients. In the case of clinical trials, researchers have gone to great lengths to protect their patients' data, de-identifying whole datasets through anonymization. A recent study suggests the data isn't as protected as you might think.

The study, conducted by researchers at Imperial College London and published in Nature Communications (DOI: https://doi.org/10.1038/s41467-019-10933-3), demonstrated that, by using a statistical data model, a patient can be re-identified at a high rate of accuracy. In fact, according to the study's authors, "99.98% of Americans would be correctly re-identified in any dataset using [fifteen] demographic attributes."

Results suggest "even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR," study authors suggest.

It's a messy business, Mariya Pinskaya, principal consultant at Areva Consulting, tells Clinical Research News. Companies and researchers are still fuzzy on the details - how to handle their patients' data in light of GDPR, even a year after its implementation.

"It's difficult to understand exactly how to stay compliant," Pinskaya says. "I think companies are for the most part compliant, but they achieve that compliance in a variety of ways... [GDPR] is vague in the sense that it allows for broad compliance options." For instance, under GDPR, a company might be defined under several categories of data collection. Are they a processor of data? A controller of data?

What we're seeing is a fluctuation of these categories, depending on the specific study in question, Pinskaya says. In order to achieve "compliance," companies need to implement certain safeguards dealing mostly with how they handle data, including how they transfer and process them.

"The reality is that a lot of this is unknown," Mark Phillips, a data protection lawyer working at the Centre of Genomics and Policy at McGill University, tells Clinical Research News. GDPR, although it impacts science and clinical research, wasn't established solely with that perspective in mind, resulting in misconceptions within the clinical research community.

"[GDPR] is mainly written from the perspective of trying to regulate a social media company or or other tech outfit, who lawmakers primarily had in mind, so it's not always clear how it will apply in other specific fields," Phillips says. He adds that companies, often outside of the EU, fear GDPR's obligations are impossible to fulfill.

"For instance, [the regulation] applies to someone outside of Europe monitoring the behavior of people inside of Europe. It seems that in many cases if you're receiving study data from a European researcher you won't fall under that specific provision, but the scope of the GDPR's application outside of Europe is still pretty unclear. We're receiving some guidance from regulators, but the guidance isn't binding law, and is sometimes contradictory."

Pinskaya points out that regulations are nonpunitive. Instead, its intention is to establish guidelines for data protection.

"I think balance in these things is important," she says. "I don't think regulation—this one being no exception—is ever put out to eradicate a problem. A regulation is, frankly, to regulate... to guide and advise companies, whether it's pharmaceutical or others, that safeguards need to be put in place in order to make sure data is properly treated."

The Nature study authors question whether "current de-identification practices satisfy the anonymization standards of modern data protection laws such as GDPR," suggesting a "move, from a legal and regulatory perspective, beyond the de-identification release-and-forget model."

"There's an entrenched debate on this issue," says Phillips. "A lot of people get overly hung up on the notion of personal data and compliance through anonymization, and they're still clinging to anonymization as a 'cure-all'."

Phillips agrees with the study's findings, saying it, "puts a final nail in the coffin" in the idea that anonymization on its own can address contemporary privacy concerns.

The article posits a situation where a health insurance company publishes a de-identified dataset of 1,000 people in California that predicts breast cancer. While the data are anonymized, they do include birth dates, gender, ZIP code, and breast cancer diagnosis. An employer downloads the dataset and is able to identify an employee, even though the dataset is "heavily incomplete." While the anonymization allows plausible deniability, the authors insist that, with their data model, "the likelihood of a re-identification to be correct, even in a heavily sampled dataset, can be accurately estimated, and is often high."

Phillips says these results back up trends he's been seeing in recent years.

"For a while people were saying that even if you have someone's whole genome sequence or some really rich data, you wouldn't really be able to connect the information to who the person is with only a series of base pairs. But over time we've found that as the world is becoming more and more data-rich, there are ways to achieve [re-identification]."

That's not to say the study is without its weaknesses.

"Although the [authors] use the word 're-identification' throughout the article when talking about its method, to me it looks more like what, under GDPR, we would call 'singling out.'"

While re-identification involves revealing or disclosing someone's identity, singling out requires pinpointing characteristics within data and tracking it. Phillips says there are clear distinctions to be made.

"The distinction is drawn within GDPR itself. They say, for example, that an important factor to consider when you're trying to decide if something is personalized data or if it's anonymous is whether or not individuals can be singled out, but they don't say it's the same thing as being 'identifiable.'"

Pinskaya believes the study might have overstated the ability of someone to act with bad intent in order to achieve identification, essentially presenting a "doomsday scenario."

"The article talked more toward a system of bad behavior than to an individual bad behavior," she says. "I think on an individual basis for someone to sit behind a computer, take a pool of data, and be able to identify a person based on the number of criteria listed in the public domain would be fairly difficult."

With potential breaches in mind, both Pinskaya and Phillips agree there is still work to do in providing solutions for data privacy concerns.

"Our best bet is not to rely on any one 'silver bullet' solution, but to incorporate a range of strategies adapted to the situation," says Phillips. "Although they don't provide the simple answer, these more proportionate approaches are the way to go, although it is hard to generalize across an entire field of data-driven science. Each project has its own specific situation."

"There's a balance we need to strike," Pinskaya says. "We certainly need data. It's the key to everything we do [in the pharmaceutical industry]. We don't collect it for fun or to check off a box; we collect it because it's critical to proper patient care."