SlideShare a Scribd company logo
1 of 36
Download to read offline
Protecting privacy in
practice
2017-06-13
Lars Albertsson
www.mapflat.com
1
Who’s talking?
● KTH-PDC Center for High Performance Computing (MSc thesis)
● Swedish Institute of Computer Science (distributed system test+debug tools)
● Sun Microsystems (building very large machines)
● Google (Hangouts, productivity)
● Recorded Future (natural language processing startup)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling)
● Schibsted Media Group (data processing & modelling)
● Mapflat (independent data engineering consultant)
2
Privacy protection resources
3
All of this might go
wrong. Large fine.
Pour your data into
our product.
404
Privacy-driven design
● Technical scope
○ Toolbox
○ Not complete solutions
● Assuming that you solve:
○ Legal requirements
○ Security primitives
○ ...
● Not description of any company
4
Archi-
tecture
Statis-
tics
Legal
Organi-
sation
Security
Process Privacy UX
Culture
Requirements, engineer’s perspective
● Right to be forgotten
● Limited collection
● Limited retention
● Limited access
○ From employees
○ In case of security breach
● Right for explanations
● User data enumeration
● User data export
5
Ancient data-centric systems
● The monolith
● All data in one place
● Analytics + online serving from
single database
● Current state, mutable
- Please delete me?
- What data have you got on me?
- Sure, no problem!
6
DB
Presentation
Logic
Storage
Event oriented systems
7
Every event All events, ever,
raw, unprocessed
Refinement
pipeline
Artifact of
value
Event oriented systems
8
Every event All events, ever,
raw, unprocessed
Refinement
pipeline
Artifact of
value
● Motivated by
○ New types of data-driven (AI) features
○ Quicker product iterations
■ Data-driven product feedback (A/B tests)
■ Fewer teams involved in changes
○ Robustness - scales to more complex business logic
Enable disruption
v
Service
Data processing at scale
9
Cluster storage
Ingress Offline processing Egress
Data
lake
DB
Service
DatasetJob
Pipeline
Service
Export
Business
intelligence
DB
DB
Import
Workflow manager
10
● Dataset “build tool”
● Run job instance when
○ input is available
○ output missing
○ resources are available
● Backfill for previous failures
● DSL describes DAG
○ Includes ingress & egress
Recommended: Luigi / Airflow
DB
Factors of success
11
● Event-oriented - append only
● Immutability
● At-least-once semantics
● Reproducibility
○ Through 1000s of copies
● Redundancy
- Please delete me?
- What data have you got on me?
- Hold on a second...
Solution space
12
Technical
feasibility
Easy to do
the right thing
Awareness
culture
Personal information (PII) classification
● Red - sensitive data
○ Messages
○ GPS location
○ Views, preferences
● Yellow - personal data
○ IDs (user, device)
○ Name, email, address
○ IP address
● Green - insensitive data
○ Not related to persons
○ Aggregated numbers
13
● Grey zone
○ Birth date, zip code
○ Recommendation / ads models?
Nota bene: This is only an
example classification
PII arithmetics
Red + green = red red + yellow = red yellow + green = yellow
Aggregate(red/yellow) = green ?
Green + green + green = yellow ?
Yellow + yellow + yellow = red ?
Machine_learning_model(yellow) = yellow ?
Overfitting => persons could be identified
14
Make privacy visible at ground level
● In dataset names
○ hdfs://red/crm/received_messages/year=2017/month=6/day=13
○ s3://yellow/webshop/pageviews/year=2017/month=6/day=13
● In field names
○ response.y_text = “Dear ” + user.y_name + “, thanks for contacting us …”
● In credential / service / table / ... names
● Spreads awareness
● Catch mistakes in code review
● Enables custom tooling for violation warnings
15
Eye of the needle tool
● Provide data access through gateway tool
○ Thin wrapper around Spark/Hadoop/S3/...
○ Hard-wired configuration
● Governance
○ Access audit, verification
○ Policing/retention for experiment data
16
Eye of the needle tool
● Easy to do the right thing
○ Right resource choice, e.g. “allocate temporary
cluster/storage”
○ Enforce practices, e.g. run jobs from central
repository code
○ No command for data download
● Enabling for data scientists
○ Empowered without operations
○ Directory of resources
17
Towards oblivion
● Left to its own devices,
personal (PII) data
spreads like weed
● PII data needs to be
governed, collared, or
discarded
○ Discard what you can
18
● Discard all PII
○ User id in example
● No link between records or datasets
● Replace with non-PII
○ E.g. age, gender, country
● Still no link
○ Beware: rare combination => not anonymised
Drop user id
Discard: Anonymisation
19
Replace user id with
demographics
Useful for
business
insights
Useful for
metrics
Partial discard: Pseudonymisation
● Hash PII
● Records are linked
○ Across datasets
○ Still PII, GDPR applies
○ Persons can be identified (with additional data)
○ Hash recoverable from PII
● Hash PII + salt
○ Hash not recoverable
● Records are still linked
○ Across datasets if salt is constant
20
Hash user id Useful for
recommendations
Hash user id
+ salt
Useful for product
insights
● Push reruns with
workflow manager
- No versioning support in tools
- Computationally expensive
- Easy to miss datasets
+ No data model changes required
Governance: Recomputation
21
● Fields reference PII table
● Clear single record => oblivion
- PII table injection needed
- Key with UUID or hash
- Extra join
- Multiple or wide PII tables
+ Small PII leak risk
Ejected record pattern
22
● Datasets are immutable
● Version n+1 of raw dataset lacks record
● Short retention of old versions
● Always depend on latest version
Record removal in pipelines
23
User keys
2017-06-12
User keys
2017-06-13
class Purchases(Task):
date = DateParameter()
def requires(self):
return [Users(self.date),
Orders(self.date),
UserKeys.latest()]
● PII fields encrypted
● Per-user decryption key table
● Clear single user key => oblivion
- Extra join + decrypt
- Decryption (user) id needed
+ Multi-field oblivion
+ Small PII leak risk
Lost key pattern
24
● Different fields encrypted
with different keys
● Partial user oblivion
○ E.g. forget my GPS coordinates
Lost key partial oblivion
25
● Encrypt key fields that link datasets
● Ability to join is lost
● No data loss
○ Salt => anonymous data
○ No salt => pseudonymous data
Lost link key
26
Reversible oblivion
● Lost key pattern
● Give ejected record key to
third party
○ User
○ Trusted organisation
● Destroy company copies
27
Data model deadly sins
● Using PII data as key
○ Username, email
● Publishing entity ids containing PII data
○ E.g. user shared resources (favourites, compilations) including username
● Publishing pseudonymised datasets
○ They can be de-pseudonymised with external data
○ E.g. AOL, Netflix, ...
28
Tombstone line
● Produce dataset/stream of
forgotten users
● Egress components, e.g. online
service databases, may need
push for removal.
○ Higher PII leak risk
29
DB Service
The art of deletion
● Example: Cassandra
● Deletions == tombstones
● Data remains
○ Until compaction
○ In disconnected nodes
Component-specific expertise necessary
30
Deletion layers
● Every component adds deletion burden
○ Minimise number of components
○ Ephemeral >> dedicated. Recycle machines.
● Every storage layer adds deletion burden
○ Minimise number of storage layers
○ Cloud storage requires documented erasure semantics + agreements.
● Invent simple strategies
○ Example: Cycle Cassandra machines regularly, erase block devices.
Increasing cost of heterogeneity
31
Retention limitation
● Best solved in workflow manager
○ Connect creation and destruction
● Short default retention, whitelist exceptions
● In conflict with technical ideal of immutable raw data
32
Lake promotion
● Remove expire raw dataset, promote derived datasets to lake
● First derived dataset = washed(raw)?
● Workflow DAG still works
33
Data lake Derived
Cluster storage
Data lake Derived
Cluster storage
Lineage
● Tooling for tracking data flow
● Dataset granularity
○ Workflow manager?
● Field granularity
○ Framework instrumentation?
● Multiple use cases
○ (Discovering data)
○ (Pipeline change management)
○ Detecting dead end data flows
○ Right to export data
○ Explanation of model decisions
34
Solicitation: PII & lineage type systems
● Idea: decorate (scala) types
○ PII classification (red/yellow/green)
○ Lineage (e.g. processing class id + commit id + dataset revision)
● Assistance with PII arithmetics
○ PII[Red, String] + PII[Green, String] => PII[Red, String]
○ PII[Red, Int] + PII[Red, Int] => PII[Green, Int]
● Detect unused PII fields
● Assist with recomputation
○ For PII cleaning
○ Bug fixes
35
Resources Credits
● http://www.slideshare.net/lallea/d
ata-pipelines-from-zero-to-solid
● http://www.mapflat.com/lands/res
ources/reading-list
● https://ico.org.uk/
● EU Article 29 Working Party
36
● Alexander Kjeldaas, independent
● Lena Sundin, Spotify
● Oscar Söderlund, Spotify
● Oskar Löthberg, Spotify
● Sofia Edvardsen,
Sharp Cookie Advisors
● Øyvind Løkling,
Schibsted Media Group

More Related Content

What's hot

What's hot (20)

Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Towards Data Operations
Towards Data OperationsTowards Data Operations
Towards Data Operations
 
Data democratised
Data democratisedData democratised
Data democratised
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Mastro
MastroMastro
Mastro
 
Mastro
MastroMastro
Mastro
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
ISNCC 2017
ISNCC 2017ISNCC 2017
ISNCC 2017
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reuse
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
Provenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructureProvenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructure
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems research
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
Real Time Big Data
Real Time Big DataReal Time Big Data
Real Time Big Data
 
Bicod2017
Bicod2017Bicod2017
Bicod2017
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open Issues
 

Viewers also liked

Viewers also liked (7)

Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
Tomáš Hajzler: Přednáška Nová práce @ Teal Club
Tomáš Hajzler: Přednáška Nová práce @ Teal ClubTomáš Hajzler: Přednáška Nová práce @ Teal Club
Tomáš Hajzler: Přednáška Nová práce @ Teal Club
 
Data Governance: Keystone of Information Management Initiatives
Data Governance: Keystone of Information Management InitiativesData Governance: Keystone of Information Management Initiatives
Data Governance: Keystone of Information Management Initiatives
 
GDPR: Requirements for Cloud Providers
GDPR: Requirements for Cloud ProvidersGDPR: Requirements for Cloud Providers
GDPR: Requirements for Cloud Providers
 
EU GDPR and you: requirements for marketing
EU GDPR and you: requirements for marketingEU GDPR and you: requirements for marketing
EU GDPR and you: requirements for marketing
 
Tomáš Ervín Dombrovský: Pohyb ve světě (práce)
Tomáš Ervín Dombrovský: Pohyb ve světě (práce)Tomáš Ervín Dombrovský: Pohyb ve světě (práce)
Tomáš Ervín Dombrovský: Pohyb ve světě (práce)
 
Revising policies and procedures under the new EU GDPR
Revising policies and procedures under the new EU GDPRRevising policies and procedures under the new EU GDPR
Revising policies and procedures under the new EU GDPR
 

Similar to Protecting privacy in practice

Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentation
gustavosouto
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 

Similar to Protecting privacy in practice (20)

Privacy by design
Privacy by designPrivacy by design
Privacy by design
 
Privacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, MapflatPrivacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, Mapflat
 
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentation
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Role of ML engineer
Role of ML engineerRole of ML engineer
Role of ML engineer
 
Cloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangCloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan Wang
 
Presto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@MyntraPresto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@Myntra
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Big Data overview
Big Data overviewBig Data overview
Big Data overview
 
Data preprocessing.pdf
Data preprocessing.pdfData preprocessing.pdf
Data preprocessing.pdf
 
Druid
DruidDruid
Druid
 
Piano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingPiano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processing
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
May The Data Stay with U! Network Data Exfiltration Techniques - Brucon 2017.
May The Data Stay with U! Network Data Exfiltration Techniques - Brucon 2017.May The Data Stay with U! Network Data Exfiltration Techniques - Brucon 2017.
May The Data Stay with U! Network Data Exfiltration Techniques - Brucon 2017.
 

More from Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
Lars Albertsson
 

More from Lars Albertsson (16)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 

Recently uploaded

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 

Recently uploaded (20)

7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 

Protecting privacy in practice

  • 1. Protecting privacy in practice 2017-06-13 Lars Albertsson www.mapflat.com 1
  • 2. Who’s talking? ● KTH-PDC Center for High Performance Computing (MSc thesis) ● Swedish Institute of Computer Science (distributed system test+debug tools) ● Sun Microsystems (building very large machines) ● Google (Hangouts, productivity) ● Recorded Future (natural language processing startup) ● Cinnober Financial Tech. (trading systems) ● Spotify (data processing & modelling) ● Schibsted Media Group (data processing & modelling) ● Mapflat (independent data engineering consultant) 2
  • 3. Privacy protection resources 3 All of this might go wrong. Large fine. Pour your data into our product. 404
  • 4. Privacy-driven design ● Technical scope ○ Toolbox ○ Not complete solutions ● Assuming that you solve: ○ Legal requirements ○ Security primitives ○ ... ● Not description of any company 4 Archi- tecture Statis- tics Legal Organi- sation Security Process Privacy UX Culture
  • 5. Requirements, engineer’s perspective ● Right to be forgotten ● Limited collection ● Limited retention ● Limited access ○ From employees ○ In case of security breach ● Right for explanations ● User data enumeration ● User data export 5
  • 6. Ancient data-centric systems ● The monolith ● All data in one place ● Analytics + online serving from single database ● Current state, mutable - Please delete me? - What data have you got on me? - Sure, no problem! 6 DB Presentation Logic Storage
  • 7. Event oriented systems 7 Every event All events, ever, raw, unprocessed Refinement pipeline Artifact of value
  • 8. Event oriented systems 8 Every event All events, ever, raw, unprocessed Refinement pipeline Artifact of value ● Motivated by ○ New types of data-driven (AI) features ○ Quicker product iterations ■ Data-driven product feedback (A/B tests) ■ Fewer teams involved in changes ○ Robustness - scales to more complex business logic Enable disruption
  • 9. v Service Data processing at scale 9 Cluster storage Ingress Offline processing Egress Data lake DB Service DatasetJob Pipeline Service Export Business intelligence DB DB Import
  • 10. Workflow manager 10 ● Dataset “build tool” ● Run job instance when ○ input is available ○ output missing ○ resources are available ● Backfill for previous failures ● DSL describes DAG ○ Includes ingress & egress Recommended: Luigi / Airflow DB
  • 11. Factors of success 11 ● Event-oriented - append only ● Immutability ● At-least-once semantics ● Reproducibility ○ Through 1000s of copies ● Redundancy - Please delete me? - What data have you got on me? - Hold on a second...
  • 12. Solution space 12 Technical feasibility Easy to do the right thing Awareness culture
  • 13. Personal information (PII) classification ● Red - sensitive data ○ Messages ○ GPS location ○ Views, preferences ● Yellow - personal data ○ IDs (user, device) ○ Name, email, address ○ IP address ● Green - insensitive data ○ Not related to persons ○ Aggregated numbers 13 ● Grey zone ○ Birth date, zip code ○ Recommendation / ads models? Nota bene: This is only an example classification
  • 14. PII arithmetics Red + green = red red + yellow = red yellow + green = yellow Aggregate(red/yellow) = green ? Green + green + green = yellow ? Yellow + yellow + yellow = red ? Machine_learning_model(yellow) = yellow ? Overfitting => persons could be identified 14
  • 15. Make privacy visible at ground level ● In dataset names ○ hdfs://red/crm/received_messages/year=2017/month=6/day=13 ○ s3://yellow/webshop/pageviews/year=2017/month=6/day=13 ● In field names ○ response.y_text = “Dear ” + user.y_name + “, thanks for contacting us …” ● In credential / service / table / ... names ● Spreads awareness ● Catch mistakes in code review ● Enables custom tooling for violation warnings 15
  • 16. Eye of the needle tool ● Provide data access through gateway tool ○ Thin wrapper around Spark/Hadoop/S3/... ○ Hard-wired configuration ● Governance ○ Access audit, verification ○ Policing/retention for experiment data 16
  • 17. Eye of the needle tool ● Easy to do the right thing ○ Right resource choice, e.g. “allocate temporary cluster/storage” ○ Enforce practices, e.g. run jobs from central repository code ○ No command for data download ● Enabling for data scientists ○ Empowered without operations ○ Directory of resources 17
  • 18. Towards oblivion ● Left to its own devices, personal (PII) data spreads like weed ● PII data needs to be governed, collared, or discarded ○ Discard what you can 18
  • 19. ● Discard all PII ○ User id in example ● No link between records or datasets ● Replace with non-PII ○ E.g. age, gender, country ● Still no link ○ Beware: rare combination => not anonymised Drop user id Discard: Anonymisation 19 Replace user id with demographics Useful for business insights Useful for metrics
  • 20. Partial discard: Pseudonymisation ● Hash PII ● Records are linked ○ Across datasets ○ Still PII, GDPR applies ○ Persons can be identified (with additional data) ○ Hash recoverable from PII ● Hash PII + salt ○ Hash not recoverable ● Records are still linked ○ Across datasets if salt is constant 20 Hash user id Useful for recommendations Hash user id + salt Useful for product insights
  • 21. ● Push reruns with workflow manager - No versioning support in tools - Computationally expensive - Easy to miss datasets + No data model changes required Governance: Recomputation 21
  • 22. ● Fields reference PII table ● Clear single record => oblivion - PII table injection needed - Key with UUID or hash - Extra join - Multiple or wide PII tables + Small PII leak risk Ejected record pattern 22
  • 23. ● Datasets are immutable ● Version n+1 of raw dataset lacks record ● Short retention of old versions ● Always depend on latest version Record removal in pipelines 23 User keys 2017-06-12 User keys 2017-06-13 class Purchases(Task): date = DateParameter() def requires(self): return [Users(self.date), Orders(self.date), UserKeys.latest()]
  • 24. ● PII fields encrypted ● Per-user decryption key table ● Clear single user key => oblivion - Extra join + decrypt - Decryption (user) id needed + Multi-field oblivion + Small PII leak risk Lost key pattern 24
  • 25. ● Different fields encrypted with different keys ● Partial user oblivion ○ E.g. forget my GPS coordinates Lost key partial oblivion 25
  • 26. ● Encrypt key fields that link datasets ● Ability to join is lost ● No data loss ○ Salt => anonymous data ○ No salt => pseudonymous data Lost link key 26
  • 27. Reversible oblivion ● Lost key pattern ● Give ejected record key to third party ○ User ○ Trusted organisation ● Destroy company copies 27
  • 28. Data model deadly sins ● Using PII data as key ○ Username, email ● Publishing entity ids containing PII data ○ E.g. user shared resources (favourites, compilations) including username ● Publishing pseudonymised datasets ○ They can be de-pseudonymised with external data ○ E.g. AOL, Netflix, ... 28
  • 29. Tombstone line ● Produce dataset/stream of forgotten users ● Egress components, e.g. online service databases, may need push for removal. ○ Higher PII leak risk 29 DB Service
  • 30. The art of deletion ● Example: Cassandra ● Deletions == tombstones ● Data remains ○ Until compaction ○ In disconnected nodes Component-specific expertise necessary 30
  • 31. Deletion layers ● Every component adds deletion burden ○ Minimise number of components ○ Ephemeral >> dedicated. Recycle machines. ● Every storage layer adds deletion burden ○ Minimise number of storage layers ○ Cloud storage requires documented erasure semantics + agreements. ● Invent simple strategies ○ Example: Cycle Cassandra machines regularly, erase block devices. Increasing cost of heterogeneity 31
  • 32. Retention limitation ● Best solved in workflow manager ○ Connect creation and destruction ● Short default retention, whitelist exceptions ● In conflict with technical ideal of immutable raw data 32
  • 33. Lake promotion ● Remove expire raw dataset, promote derived datasets to lake ● First derived dataset = washed(raw)? ● Workflow DAG still works 33 Data lake Derived Cluster storage Data lake Derived Cluster storage
  • 34. Lineage ● Tooling for tracking data flow ● Dataset granularity ○ Workflow manager? ● Field granularity ○ Framework instrumentation? ● Multiple use cases ○ (Discovering data) ○ (Pipeline change management) ○ Detecting dead end data flows ○ Right to export data ○ Explanation of model decisions 34
  • 35. Solicitation: PII & lineage type systems ● Idea: decorate (scala) types ○ PII classification (red/yellow/green) ○ Lineage (e.g. processing class id + commit id + dataset revision) ● Assistance with PII arithmetics ○ PII[Red, String] + PII[Green, String] => PII[Red, String] ○ PII[Red, Int] + PII[Red, Int] => PII[Green, Int] ● Detect unused PII fields ● Assist with recomputation ○ For PII cleaning ○ Bug fixes 35
  • 36. Resources Credits ● http://www.slideshare.net/lallea/d ata-pipelines-from-zero-to-solid ● http://www.mapflat.com/lands/res ources/reading-list ● https://ico.org.uk/ ● EU Article 29 Working Party 36 ● Alexander Kjeldaas, independent ● Lena Sundin, Spotify ● Oscar Söderlund, Spotify ● Oskar Löthberg, Spotify ● Sofia Edvardsen, Sharp Cookie Advisors ● Øyvind Løkling, Schibsted Media Group