You’d better hope so.

469Cavoukian

Ann Cavoukian knows a thing or two about privacy. For 17 years, she served as Ontario’s provincial Information and Privacy Commissioner and was outspoken on a variety of topics, ranging from government surveillance powers to online ad tracking.  Now, as executive director of the Privacy and Big Data Institute at Ryerson University, she’s got as much to say as ever about how privacy can survive amid a seismic shift in information architecture.

Cavoukian, who will deliver a keynote speech at SecTor next month, has turned her eye toward two related emerging technologies: Big Data and the Internet of Things.

The Internet is becoming a constellation of billions of devices, many of which will work autonomously, gather data from their surroundings, and communicate in peer networks. Big data capabilities will be necessary to understand all of the information that they generate, and to realise real-world benefits.

Big data and the IoT may offer huge benefits, but concerns about the privacy implications are already emerging. Data streams are becoming fatter, and faster. Software is connecting disparate data sources and enabling organisations to draw new conclusions about people. Isn’t that at least a little worrying?

Having your cake and eating it, too

Cavoukian is convinced that big data and privacy can co-exist. Privacy isn’t a zero-sum game, she suggested, arguing instead for a “positive sum” solution, in which technology users can have innovation and privacy in one go.

The way forward, she believes, involves a concept of her own design. In the 1990s, she invented the term Privacy by Design (PbD), which contains seven principles ranging from the early mitigation of privacy issues when developing IT systems, through to the adoption and integration of privacy-enhancing technologies. Organizations can enjoy innovation and privacy by adopting the principles of PbD, according to Cavoukian.

“It requires a lot more creativity and innovation, but the end result is so much superior, because you end up getting big data and analytics with privacy embedded into the system,” she said.

De-identification lies at the heart of a PbD approach to big data. This process strips personally-identifiable information (PII) out of data sets, leaving analytics systems to concentrate on processing aggregated numbers.

“There are now so many standards and protocols out there to show you how to do it,” she said. “Once you do that, then you’re free to engage in data analytics and connect the data in a variety of ways. The sky’s the limit. That’s what enables privacy and big data, or privacy and IoT. You can, you must be able to do both”.

Flawed algorithms

De-identification has had some bad press over the years, though. In several cases, researchers have been able to cross-reference large data sets to re-identify personal information that had been stripped out.

In 2006, one of the most famous cases embarrassed Netflix. The firm had released the movie-watching history of 500,000 customers in anonymous form, as part of a competition to improve its recommendation engine. University of Texas researchers honed in on their identities using statistical techniques. Netflix subsequently cancelled the competition.

Just today Latanya Sweeney, the same Harvard professor who nailed the medical records of Massachussetts Governor William Weld, revealed how she used news stories to find the identities of patients in anonymous patient data. Researchers have also been able to re-identify individuals who provided genetic material, and pinpoint driver information from poorly-anonymized taxicab logs.

That’s the point, argued Cavoukian: they’re poorly anonymized. “In each of those, a dozen cases, no more, the protocol that was used to de-identify the data was weak,” she said, drawing a comparison with weak vs strong encryption. “So I completely reject that premise. It’s nonsense. It’s only easy to re-identify the data if you haven’t done a good job of de-identifying it at the source.”

For some, the jury is still out. In his paper, Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization, Paul Ohm, former senior policy advisor to the Federal Trade Commission and professor of law at Georgetown University Law Center, suggested that privacy and utility are fundamentally opposed in the context of data science.

“So long as data is useful, even in the slightest, then it is also potentially reidentifiable,” he said. “Moreover, for many leading release-and-forget techniques, the tradeoff is not proportional: As the utility of data increases even a little, the privacy plummets.”

Jane Yakowitz, associate professor of law at Arizona’s James Rogers College of Law, carries an opposing view.

“The risks imposed on data subjects by datasets that do go through adequate anonymization procedures are trivially small,” she said in her paper, The Tragedy of the Commons, citing the low probability of an adversary existing and the level of identification risk from other sources.

As our understanding of the techniques evolves, new concepts come to light, such as differential privacy, which aims to maximize the accuracy of data queries while minimizing the chance of re-identification. And scientists are also making progress with homomorphic encryption, in which we can get results from encrypted data without decrypting it.

One subset of homomorphic encryption, called ‘somewhat homomorphic encryption’, promises new opportunities here, Cavoukian said.

“The things that prevented people from using this in the past, which was that it was cumbersome and required a lot of computing power, have been addressed through somewhat homomorphic encryption, which preserves the value of homomorphic encryption but speeds it up considerably,” she said. This is already in use in some projects such as MIT’s Enigma secure cloud initiative.

One eye on the future, the other on the past

One way or another, big data and privacy must co-exist in the future. When they do, how will we deal with the past?

History is fraught with poorly-designed systems, but one of the promises of big data and the Internet of Things is that we get a do-over. We build new systems that work together in new ways, and because we’re designing from the ground up, we learn from our mistakes and code them properly.

That’s the dream. Even if we’re that well-organized, the world is full of legacy systems with the privacy transgressions already baked in. What of those?

“Legacy systems are fraught with problems, not just in terms of privacy, but in terms of all the security-related issues, and the procedural issues. Over time, they’re going to have to be redesigned anyway. They’re going to have to be upgraded,” Cavoukian said.

That old tin won’t be going away that quickly, though, and while it’s here, we’re going to have to cope with it. In 2011, working with W.P Carey School of Business associate professor Marylin Prosch, Cavoukian conceived Privacy by Redesign, an attempt to articulate the redesign of those systems.

What does this look like in practice? It may require that less data be collected, and may see some databases being slimmed down, the document said. “It could involve building in additional code that would embed de-identification processes into data that is to be reused for secondary purposes,” Cavoukian added.

There’s no doubt that we’re at a pivotal point in the privacy discussion. With new analytical models and rapid innovations in the kinds of devices connecting to the Internet, the stakes are higher than ever. If organizations don’t get privacy right as they design and implement IoT and big data technologies, we may find ourselves struggling to staunch the flow of data after the fact.

Interested in finding out more? Register at SecTor, which takes place at Metro Toronto Convention Centre in downtown Toronto on October 20-21, with a training day on October 19.