Privacy, Security and Ethics
in Data Science
Nikolaos Vasiloglou
1
Summary
● From Public to Private datasets
● Anonymizing data
● Anonymizing computations
● Seeking security for my data
● The unethical surprise of a data scientist
● Data science, the opportunity to build a more equal world
2
When are public data useful to a data scientist?
● Public data are by default anonymized (census data)
● By its nature there is no privacy concern (imagenet)
● Public data come with an identifier that allows user to join them with private
data (census)
● Public data can semantically join without an id (imagenet)
3
Public datasets that are not so useful
● Netflix or movielens and any other recommender datasets
● There is no way you can join them with real users
● They are only good for testing your algorithm
● Get an expectation of the accuracy range
4
Not so public not so private data
● Twitter data
● Facebook data
● Yelp data
● Amazon data
5
What is wrong about this data?
● Some of them are public to your friends but not to everybody
● Even when they are public, they might not be personally identifiable
● The fact that they are there does not mean you can use them without consent
● The Cambridge Analytica case
6
How to respect people’s privacy
● The minimum you can do is store them in a safe place
● Is cloud safe?
● How safe can it be?
● Is encryption enough?
7
Major Failures
8
How many layers of protection should I add?
● Two factor authentication
● VPN
● Encryption
● ….
● What is wrong with it?
● Secure -> Difficult -> People become creative in exfiltrating the data
9
Is my laptop safer than my company’s servers?
● Let’s discuss it!
● Who is the best target?
10
Use the usual trick
● Distribute your data
● Why is this helpful?
11
12
An example: Distributed Addition
13
14
Another example: Distributed multiplication
15
Homomorphic encryption another direction
16
17
18
19
20
Does encryption really respect privacy?
● What if I train a classifier and then throw the data?
● Can the classifier leak the data trained?
21
Adversarial Attacks
● Reconstructing training datasets
22
Reconstructing images
23
Reconstructing text
24
Differential Privacy, the remedy
25
What is differential privacy
● Practically speaking, a clever way to add
noise on your model without hurting
performance
● More or less the same trick like gradient
descend
● Bayesian model give you differential privacy
for free!!
26
What if you are not allowed to see the data?
● Sensitive and personal data like email
● European Union does not allow storing Personally Identifiable Information
● What representation should you use ?
27
Generative Models
● GANs
● LSTMs
28
Are they safe?
● Almost!
● It is possible to leak information
● You have to be carefull
29
What if we redact the sensitive information
● Netfix 2
30
Identifying people from location data
31
Life after model, ethical responsibility
32
Some Facts about ML algorithms
● Garbage In -> Garbage Out
● Racism In -> Racism Out
33
A never ending list of failures
34
35
36
37
Social Bias
38
More on social bias
39
More sad failures
40
Debiasing is possible
41
You can fix it
42
Use legitimate source of information
43
NIPS devoted a keynote and a worksop
44
Conclusion
45

Privacy, security and ethics in data science