Strange Loop 2017 lightning talk:
Practical things you can do to protect the privacy of your user, because it's the right thing to do, and because upcoming EU laws do apply to you.
6. Privacy, Engineering
●Minimizing use of
personal data
●Protect against
insiders
●End-to-end
encryption
●Wipeout
○Deleting all of a user's
data when they delete
their account
●Takeout
○Allowing user to
download all their
7. Wipeout Challenges
Where is a user's data?
●Especially hard in
schemaless database
●Extra copies
(denormalization)
●Scattered over
How to delete it
●A "wipeout pipeline"
of deleted accounts
●Each system listens
to pipeline
●O(user data) not
10. Thank you. Let's make the world a safer
place!Contact me @eob or <eobrain AT google.com>
References mentioned:
1. leanpub.com/queerprivacy, Sarah Jamie Lewis (ed)
2. Stories from survivors: Privacy & security practices when
coping with intimate partner abuse, Tara Matthews et al
3. General Data Protection Regulation (GDPR)
Editor's Notes
Hi, I'm going to talk why you as a developer would want to protect the privacy of your users, and what are the practical engineering things you can do.
The problem with privacy is most users don't ask for it. What they want is the cool features of your app.
However I think we should think of privacy the way we think of latency where tail latency is what matters. For privacy protection we should consider the small fraction of people who are especially vulnerable, who would suffer serious harm if their privacy was violated. And that small percentage of people should determine what we do.
But there are also business reasons to protect privacy. Your company can suffer significant reputational damage from a data breach that exposes personal data. Oh, and by the way, guess what is the #1 cause of data breach? It's employee error. We'll get back to that when we talk about the insider threat.
And finally there's GDPR. Have people heard about that? Well in a year's time you will. It's a new European privacy law that takes effect next May. It raises the bar on privacy, and most importantly, if you have any users in Europe, this law probably applies to you, no matter where you are, even if you're a startup here in St. Louis. So, like it or not, you're going to have to up your game in privacy.
Image Credit: https://commons.wikimedia.org/wiki/File:European_Union_in_the_World.svg
So what is privacy anyway? Legally, privacy is pretty complex, and you probably need a lawyer to figure out
But I'm an engineer, so let's talk engineering, and some of the practical things you can do.
First of all you should applying the YAGNI principle to personal data: if you're not sure you're going to need personal data then don't store it. If you're not storing it, you don't have to worry about protecting it.
And a word on the insider threat. You may trust your colleagues, but still you should lock down access by insiders to user data, for example using access controls and audit logging from your cloud provider. Remember, we saw earlier that the #1 cause of data breach is employee error, so it is good to be protected against yourself. Also this will mitigate the harm from an an attacker who steals the credentials of you or your colleagues to impersonate you.
But let's concentrate on one particular thing we can do: WIpeout. This is giving the user the ability to delete all of their data that you are storing, typically when they delete their account.
There's two hard problems.
One is identifying where the user-data is. Now if it were all in some normalized SQL database, this would be easy. Just use the schema to prepare a DELETE request. But let's consider a schemaless database, where client code can write data anywhere, then it is harder. You also have to consider that there may be multiple copies of some data, or that data will be scattered over multiple backend systems: databases, blob stores, caches, pub-sub queues etc.
Then there is the problem of how to do the deletion. One approach, is to have a wipeout pipeline where all deletion requests go, and then have each system that might contain user data listening to it, triggering them to delete the user data. One note on efficiency, for scalability we have to make sure that the data deletion time is of order the size of the user data and not order of the size of the database: we don't want to be scanning through the entire database.
So we did an experiment to create an open-source library that could generically do wipeout on the type of system you see here, which is a serverless system that has an authenticated client device talking directly to a database. Some authorization rules make sure that a particular user only has access to the parts of the database they are allowed to