Does Data Anonymization Really Hide Your Identity?
Despite firms’ attempts to anonymize user data, identities can still be detected
Data is all around us, to the point where it can potentially be unnerving.
But Jiaming Xu, assistant professor in decision sciences at Duke University’s Fuqua School of Business, stressed that although many people are concerned about data privacy, they also experience many benefits from the ways their data is used.
Predictive analytics can help people find what they’re looking for faster, Xu said. In health care, algorithms can help identify patients at risk for certain diseases, or which patients are most likely to benefit a certain drug or treatment, he said.
However, even as firms attempt to protect users by anonymizing their data, there is the possibility their identities could be discovered. In a live discussion on Fuqua’s LinkedIn page, Xu explained some of his research on network data privacy and how easily information can be traced back to individual users.
“Due to the rapid developments of machine learning and data science, companies and individuals rely more and more on data to solve decision problems in businesses,” Xu said. “However, it turns out that collecting and disseminating a large amount of data in bulk can potentially expose customers to serious privacy breaches,” he said, noting a 2021 incident in which anonymized data for as many as 700 million LinkedIn users was scraped from a public dataset and could be used to reveal personal data such as user names, phone numbers and email addresses.
Data anonymization and signatures
Even with normal anonymization and sanitization where a user’s identity has been removed and some of their activity redacted, users have unique patterns of behavior and can still be re-identified from these signatures, Xu said.
A contest launched by the entertainment company Netflix offers a famous example of how easily users can be re-identified from information that doesn’t include their names, he said. In 2006, Netflix challenged contestants to create an algorithm that was 10 percent better than its current algorithm in predicting users’ movie ratings.
Here, the data was anonymized, but when compared with data from the popular movie-rating website IMDb.com, researchers could identify users based on how they rated the same movies across both websites, Xu said. Netflix ended up halting the challenge due to these privacy concerns.
Another example of a signature is the friends or connections a person has on social networks such as Facebook and LinkedIn. In other words, two users who share many common friends across various social media networks may actually be the same person, Xu explained.
This was the basis for a well-known case in which researchers were able to use this signature to correctly identify about 30 percent of people with profiles on Twitter and the social network Flickr based on their direct connections, Xu noted.
However, simply comparing a person’s popularity, or number of connections, may not be the most reliable method, Xu said. In his own research, Xu suggests this technique could go even further in confirming a person’s identity by not only looking at their direct connections, but also examining the popularity of those direct connections. Using this method could re-identify users from anonymized data even if the connections across a person’s social media networks vary by up to 11 percent, Xu’s research found.
What’s next in data privacy?
There are measures consumers can take to protect their identities, such as being cognizant of the information they post across platforms or using pseudonyms, Xu said. But because the nuances of data privacy are hard for consumers to understand, the onus may truly be on companies to develop better anonymization schemes.
While most of the scenarios Xu discussed concern network privacy specifically, companies such as Apple and Google are trying new methods to protect user data including differential privacy, he said. With this strategy, firms may share information about groups in a dataset rather than releasing information about any individuals.
Yet another strategy is to nip the problem in the bud and avoid user data all together, Xu said.
“Privacy is leaked because companies are collecting data from users,” he said. “So the natural question is then, what if companies can still achieve their data analytics goals without collecting users’ data?”
This is called federated learning, Xu explained – a way to analyze behavior without actually collecting user data.
Ultimately, firms have incentives to develop better ways of protecting consumer privacy, he said.
“If you are the manager of a business, if you can develop some better anonymization schemes, this actually gives you some competitive advantage over other companies, in the sense that customers are more willing to share their information with you,” Xu said. “So then, this can potentially mean an opportunity for your business to grow.”
This story may not be republished without permission from Duke University’s Fuqua School of Business. Please contact media-relations@fuqua.duke.edu for additional information.