How GDPR Affects Data Science & Its Future Development
GDPR (General Data Protection Regulation) is a directive by the European Union (EU) passed back in 2018. It's having a big effect on the way technology companies use the data of the users.
The main theme of this directive is to allow individuals to stop companies from holding and processing personal data such as name, email, address, GPS location, medical data, among many other things, without the permission of the individual. The law forbids companies from holding peoples’ data without their knowledge and enforces a strict opt-out feature for all services so that the user can choose what happens to their data.
Let's have to look at why such a wide data protection act was passed in the first place.
The Quest for Big Data
For the past decade, the internet has exploded, starting from the first world, then it spread like wildfire, even in poorer countries. In countries like India, even families living on $200-300 per month can sometimes afford a smartphone and have access to things like Facebook and Google. And with this wide adoption of networked computers came a tsunami of data, not just any data; personal data of all kinds.
If we go back a few generations, letter writing and telephones were the best way to communicate. Now, our phones are pinging every few minutes with messages that may be sent from halfway across the planet. The Internet has been so widely adopted is because of free services like Google search and Facebook social communication. But obviously, those companies are not free to customers for a charitable cause, they make profits mainly through advertisements.
This means, the more advertisements you can sell, the better your company will profit. Google has the biggest online ad network in the world, and why is it so successful? Because it shows the most relevant ad to the right people. And online ads are not like conventional billboard banners or newspaper posters, they are served to millions of people in real-time and can be personalized for each and every person. So how are the most successful ad companies creating such good ad serving systems? Data science.
Google is one of the most innovative companies in the world because it invents new techniques by funding brilliant people from diverse backgrounds to research and find new things. Decades before our now ubiquitous internet, Google was one of the few companies that were recording and storing the search results, to analyze their user data and make their products better.
How Digital Ads Works
They pioneered systems that could understand the user's needs better and give better search results. But at the same time, that data was being used to understand the behavior of the users and how their preferences affect the ads they click on, and whether they buy them.
If you have no babies but Google randomly shows you baby clothing ads, then Google wouldn’t have been the top ad seller. Relevance is key. However, Google shows you relevant ads to maximize the chance of you clicking through and then purchasing a product, by using your personal information in any way it can. The simplest example is your IP address, which is naturally designed to reveal the city and country you live in.
If you enter Youtube from the US, you will definitely see American brands on ads, but if you VPN through India, then you will be served Indian ads. The second most obvious thing is the search term you put in; if you are searching for car pics, you will probably see car ads and similar topics. But how can the ads become even better? By predicting who exactly needs this product and then showing those ads to the right group of people. And that is only possible if Google has a huge amount of personal data from all sorts of people.
Data Tracking in most online business
Why do you think Gmail is free? Your email contains some very personal details about you. How about Youtube? An even richer source of information about you, the videos you watch reveal more about you than you know about yourself. Why is Facebook putting “like” buttons on third-party web pages? So that they can track you across all those sites, and Facebook will know when you read that article. The same goes with Google Analytics, it's free to site owners so that Google can know what people are up to, even when they are not on Google-owned sites. This is how this huge stream of data gets analyzed using data science to build excellent ad serving systems that maximize ad sales.
And the vast majority of the models used by big tech companies are statistical models, which get “trained” on data, the statistical models learn patterns from the data. And the general rule for many models is that the more data you give it, the better it predicts. Hence, big tech companies with their millions or billions of data points can build the best systems, because no one else has access to more personal data than them.
And thus storing such huge amounts of data for a long time can lead to massive consequences, such as hacked databases and effects on national elections such as the Facebook-Cambridge Analytica scandal. The personal data trove isn’t just about advertisements anymore.
GDPR Compliant Data Science
After GDPR, the data scientists at commercial companies have to be more careful as to what personal data they store and process. In fact, data scientists should ideally consult and get approval from the legal experts at the company before implementing a data science project.
There are many sides to GDPR compliance, such as data protection, consent, and the explainability of automatic decision-making.
Consent of Data
Consent from the user is mandatory and should be clearly stated to the user before getting their approval. In the past, things like terms and conditions have been very evasive and unclear about what and how a company processes data of the users. But now, it is advised that before you collect any data about the user, you need to seek their permission. In fact, big sites like Google and Facebook don’t really give you separate opt-out tick boxes on the settings page, instead, they give you an option to delete your account and all your data with it. Hence, online services like Google and Facebook don’t really give you granular control over what happens, either you are in, or you are out.
Ideally, there should be options in the setting that allows the user to control how each data point gets used. So, if you are using the data in multiple ways, e.g. you work at an eCommerce site and you are using the user address not just for deliveries, but also for marketing analytics and sales prediction models, then you need to explicitly state that. And maybe you should have an opt-out button in the settings too.
Examples of legal actions against big tech companies like Oracle and Salesforce have already occurred related to cookie consent using GDPR.
Automatic Decision Making and Explainability
Statistical and machine learning models are widely used nowadays, especially by big tech companies. How does facebook recognize faces in your photos? It's a deep learning model that was trained on millions of faces. How is Uber mapping the customers to their drivers? It's a probabilistic model of a certain type. But one big limitation in many machine learning models is that their outputs are hard to reason with, given the input data.
Most machine learning models “train” on data to learn from it, then it can predict on data that it has never seen before, which is what happens when you use a machine learning model in the real world. Now, you wouldn’t ask many questions if Facebook’s facial recognition algorithm shows a wrong label, or if Gmail’s spam filter misses a spam email.
But what happens when a machine learning algorithm is taking in your financial records and credit history as input, and outputting whether you will receive a loan or not? You will definitely want to know the exact reason behind a declined loan proposal. Or take for instance the online job application process; it is so much easier to just sit at home and apply for thousands of jobs all across the country or the world.
As a result, recruiters are receiving thousands or tens of thousands of job applications for any opening, and it is not possible to sort through those applications by reading them one by one. As a result, natural language processing is often applied to rank these applications, and they might not be so accurate. Hence, your application might automatically get rejected without the recruiter ever glancing at it and you might not receive any explanation as to why you didn’t make the cut.
This is why GDPR makes sure that you don’t fall victim to automatic decisions and have the right to an explanation when declined. Read more about profiling.
Hence, companies should deploy a model only when the automatic decisions have an explanation attached to them. So, for example, deep learning models have been known to be a black box in many cases and hence should be avoided if explainability can not be achieved. And explainability is a current research topic, with both corporate research labs and academia working on making machine learning models more transparent in their reasoning.
Anonymization of personal data
GDPR does restrict a lot of personal data usage without consent, however, it does allow anonymous and non-identifiable data to be published and used. Anonymization should remove any information that can trace back to the user, for example, if Fitbit strips off all names and other info and shares the heart rate data for millions of people, that’s fine. But if you take DNA or fingerprint data, or even GPS coordinates of your home, that is not allowed, since, with enough work, it can be traced back to the individual.
Hence, data scientists must guarantee anonymity before use or share it with others outside the company.
Take for instance the case of the Danish taxi service, which violated GDPR by not anonymizing and holding data about the rides (GPS coordinates) for a long time.
Encrypted data science?
Homomorphic encryption is a very promising method that might solve the data protection problem and will allow algorithms to run on user data, without actually knowing the data. Imagine if you didn’t want to share your data directly, but by encrypting it in a certain way, the data becomes masked.
But since it is homomorphic encryption, the algorithms can still do things like addition and multiplication on them, and the company will be able to extract useful statistical insights from the data, without having access to the raw data. For the time being, homomorphic encryption is very slow and non-practical, but in the future, there might be breakthroughs that allow the processing of encrypted user data. Only time will tell whether this works out or other alternatives emerge.
Areas in which data science developers should focus :
- Local Privacy Laws
- Transparency in algorithms while doing a handshake with other systems
- Simplification of data science module & its workflow
- Sorted Data access logs
- Blocking VPN requests for mission-critical data updates.(e.g Payment processor data collection module)
Practical data science has been hardened to some extent by GDPR, but it is more important to protect the rights of the individual, especially against the severe abuse of personal data. The regulations will change with time and location. And it will be interesting to see how data scientists and machine learners innovate to overcome these challenges.