Data Science and Privacy. How much is it okay to know?

It’s often said that technology outpaces regulation, but when it comes to data privacy, regulation is quickly catching up.

RMIT Online4 min readUpdated on 3 June 2020

When Big Data started gathering commercial momentum back in 2010, the concept of ‘data privacy’ was a quaint throwback. A relic of an earlier, more innocent internet. In 2010, people assumed that an interconnected, open-sourced digital world was a good thing. Something to strive for. The more data you harvested, the better your decision-making, the more fine-tuned your marketing strategy, the more efficient your business. If people were careless with their personal data, well, that was unfortunate, but it was hardly industry’s responsibility.

Fast-forward to 2019, and the data privacy landscape looks very, very different. Today, data privacy, security and governance are probably the biggest challenges facing data scientists, especially as automation and machine learning surge into the mainstream. Privacy isn’t an ideal or some annoying commercial obstacle anymore – it’s a legal obligation.

It’s raised one very important question: how much is it okay to know?

The new world of data regulation

The shift towards data regulation came swiftly. More than one hundred countries have passed dedicated data privacy legislation over the last few years. The most significant reforms – which pretty much set the tone and standard for global data collection – were the European General Data Protection Regulation (GDPR) and the California Consumer Privacy Act, both of which passed in 2018.

Many of these regulations were a direct result of 2017, which has gone down as one of the shakiest data security years on record. American credit monitoring company Equifax suffered a data breach that affected 143 million people, leaking names, birth dates, Social Security Numbers and around 200,000 credit cards. Before that there was the Verizon scandal, where the personal data of 14 million customers (including PIN numbers) was left sitting in an unprotected Amazon S3 storage server. The year before that, Uber admitted that hackers had stolen personal information from almost 57 million drivers (the company paid the criminals $100,000 to keep the breach quiet).

Data breaches were becoming more common, and the average cost per breach was around $3.6 million. Governments basically decided that enough was enough.

Data Science and privacy

There’s a tension inside data science between transparency and protection. Without data collection and the free movement of information, data science wouldn’t exist, but the more information you gather, the more complicated protection becomes.

Data scientists need to figure out how to make sure data is both secure and accessible (for when lawful disclosure is required). They need to make data shareable, but also allow people to retract information, if necessary. When working on big projects across multiple countries, they also have to comply with overlapping privacy regulations.

There are a few ways to tackle these challenges. Data Anonymization is one: de-identifying collected data and separating the information from the real people to which it applies. In fact many data privacy regulations, including the GDPR, have made data anonymization a direct requirement. From a business standpoint, this isn’t ideal – data anonymization isn’t reversible, and if you scrub all personal information from collected data, it’s pretty hard to actually use it for anything. Data Generalisation is an alternative, where companies ‘clump’ data into broad categories (age groups, geographical areas etc), while making sure the data can’t be converted back into its granular form.

These measures are becoming more standard across various industries, but they’re not 100 per cent effective…

AI and data privacy

“Seemingly anonymized personal data can easily be de-anonymized by AI,” says Bernhard Debatin, director of the Institute for Applied and Professional Ethics. “It also allows for tracking, monitoring, and profiling people as well as predicting behaviours. Together with facial recognition technology, such AI systems can be used to cast a wide network of surveillance. All these issues raise urgent concerns about privacy.”

As machine learning models become more ubiquitous in the manipulation of Big Data, they create a whole set of thorny, privacy-related questions. How do we monitor our algorithms? Is it possible to accurately predict what these models will do? (The so-called “Black Box Problem” refers to non-guided neural networks that, essentially, tech themselves, making realistic regulation tricky, if not flat-out impossible).

Regulations like the GDPR have tried to lay down some AI ground rules when it comes to data collection and processing. AI systems have to be transparent, they must have a “deeply rooted” right to the information they’re collecting, consumers have to be able to opt out of the system, the purpose of the AI must be limited by design, and data must be deleted upon consumer request. Of course, policing these policies and getting global corporations to comply is another matter.

The future of data privacy

There’s a debate raging at the moment over whether AI is an enemy or an ally when it comes to data privacy. On the one hand, ‘ethical AI’ models might make data collection more compliant, and AI algorithms are certainly fighting on the front lines of cyber security. On the other hand, the bad guys have AI too – AI-powered cyber attacks are probably the biggest global threat to data security. And there are more insidious threats: bad data and human prejudice can magnify all sorts of nasty AI-driven bias, which is pretty much the opposite of what data science is trying to achieve.

The future of data privacy is uncertain, but the general direction is pretty clear. It’s unlikely we’ll ever go back to the days of 2010, where security was considered (at best) a nice-to-have. Principles of data privacy are now baked into the legislative framework and Data Privacy Management roles are becoming more in demand within organisations and across industry.

Data privacy is here to stay, but only time will tell if it’s effective.