Protecting privacy in practice

Lars Albertsson
Lars AlbertssonFounder & Data Engineer
Protecting privacy in
practice
2017-06-13
Lars Albertsson
www.mapflat.com
1
Who’s talking?
● KTH-PDC Center for High Performance Computing (MSc thesis)
● Swedish Institute of Computer Science (distributed system test+debug tools)
● Sun Microsystems (building very large machines)
● Google (Hangouts, productivity)
● Recorded Future (natural language processing startup)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling)
● Schibsted Media Group (data processing & modelling)
● Mapflat (independent data engineering consultant)
2
Privacy protection resources
3
All of this might go
wrong. Large fine.
Pour your data into
our product.
404
Privacy-driven design
● Technical scope
○ Toolbox
○ Not complete solutions
● Assuming that you solve:
○ Legal requirements
○ Security primitives
○ ...
● Not description of any company
4
Archi-
tecture
Statis-
tics
Legal
Organi-
sation
Security
Process Privacy UX
Culture
Requirements, engineer’s perspective
● Right to be forgotten
● Limited collection
● Limited retention
● Limited access
○ From employees
○ In case of security breach
● Right for explanations
● User data enumeration
● User data export
5
Ancient data-centric systems
● The monolith
● All data in one place
● Analytics + online serving from
single database
● Current state, mutable
- Please delete me?
- What data have you got on me?
- Sure, no problem!
6
DB
Presentation
Logic
Storage
Event oriented systems
7
Every event All events, ever,
raw, unprocessed
Refinement
pipeline
Artifact of
value
Event oriented systems
8
Every event All events, ever,
raw, unprocessed
Refinement
pipeline
Artifact of
value
● Motivated by
○ New types of data-driven (AI) features
○ Quicker product iterations
■ Data-driven product feedback (A/B tests)
■ Fewer teams involved in changes
○ Robustness - scales to more complex business logic
Enable disruption
v
Service
Data processing at scale
9
Cluster storage
Ingress Offline processing Egress
Data
lake
DB
Service
DatasetJob
Pipeline
Service
Export
Business
intelligence
DB
DB
Import
Workflow manager
10
● Dataset “build tool”
● Run job instance when
○ input is available
○ output missing
○ resources are available
● Backfill for previous failures
● DSL describes DAG
○ Includes ingress & egress
Recommended: Luigi / Airflow
DB
Factors of success
11
● Event-oriented - append only
● Immutability
● At-least-once semantics
● Reproducibility
○ Through 1000s of copies
● Redundancy
- Please delete me?
- What data have you got on me?
- Hold on a second...
Solution space
12
Technical
feasibility
Easy to do
the right thing
Awareness
culture
Personal information (PII) classification
● Red - sensitive data
○ Messages
○ GPS location
○ Views, preferences
● Yellow - personal data
○ IDs (user, device)
○ Name, email, address
○ IP address
● Green - insensitive data
○ Not related to persons
○ Aggregated numbers
13
● Grey zone
○ Birth date, zip code
○ Recommendation / ads models?
Nota bene: This is only an
example classification
PII arithmetics
Red + green = red red + yellow = red yellow + green = yellow
Aggregate(red/yellow) = green ?
Green + green + green = yellow ?
Yellow + yellow + yellow = red ?
Machine_learning_model(yellow) = yellow ?
Overfitting => persons could be identified
14
Make privacy visible at ground level
● In dataset names
○ hdfs://red/crm/received_messages/year=2017/month=6/day=13
○ s3://yellow/webshop/pageviews/year=2017/month=6/day=13
● In field names
○ response.y_text = “Dear ” + user.y_name + “, thanks for contacting us …”
● In credential / service / table / ... names
● Spreads awareness
● Catch mistakes in code review
● Enables custom tooling for violation warnings
15
Eye of the needle tool
● Provide data access through gateway tool
○ Thin wrapper around Spark/Hadoop/S3/...
○ Hard-wired configuration
● Governance
○ Access audit, verification
○ Policing/retention for experiment data
16
Eye of the needle tool
● Easy to do the right thing
○ Right resource choice, e.g. “allocate temporary
cluster/storage”
○ Enforce practices, e.g. run jobs from central
repository code
○ No command for data download
● Enabling for data scientists
○ Empowered without operations
○ Directory of resources
17
Towards oblivion
● Left to its own devices,
personal (PII) data
spreads like weed
● PII data needs to be
governed, collared, or
discarded
○ Discard what you can
18
● Discard all PII
○ User id in example
● No link between records or datasets
● Replace with non-PII
○ E.g. age, gender, country
● Still no link
○ Beware: rare combination => not anonymised
Drop user id
Discard: Anonymisation
19
Replace user id with
demographics
Useful for
business
insights
Useful for
metrics
Partial discard: Pseudonymisation
● Hash PII
● Records are linked
○ Across datasets
○ Still PII, GDPR applies
○ Persons can be identified (with additional data)
○ Hash recoverable from PII
● Hash PII + salt
○ Hash not recoverable
● Records are still linked
○ Across datasets if salt is constant
20
Hash user id Useful for
recommendations
Hash user id
+ salt
Useful for product
insights
● Push reruns with
workflow manager
- No versioning support in tools
- Computationally expensive
- Easy to miss datasets
+ No data model changes required
Governance: Recomputation
21
● Fields reference PII table
● Clear single record => oblivion
- PII table injection needed
- Key with UUID or hash
- Extra join
- Multiple or wide PII tables
+ Small PII leak risk
Ejected record pattern
22
● Datasets are immutable
● Version n+1 of raw dataset lacks record
● Short retention of old versions
● Always depend on latest version
Record removal in pipelines
23
User keys
2017-06-12
User keys
2017-06-13
class Purchases(Task):
date = DateParameter()
def requires(self):
return [Users(self.date),
Orders(self.date),
UserKeys.latest()]
● PII fields encrypted
● Per-user decryption key table
● Clear single user key => oblivion
- Extra join + decrypt
- Decryption (user) id needed
+ Multi-field oblivion
+ Small PII leak risk
Lost key pattern
24
● Different fields encrypted
with different keys
● Partial user oblivion
○ E.g. forget my GPS coordinates
Lost key partial oblivion
25
● Encrypt key fields that link datasets
● Ability to join is lost
● No data loss
○ Salt => anonymous data
○ No salt => pseudonymous data
Lost link key
26
Reversible oblivion
● Lost key pattern
● Give ejected record key to
third party
○ User
○ Trusted organisation
● Destroy company copies
27
Data model deadly sins
● Using PII data as key
○ Username, email
● Publishing entity ids containing PII data
○ E.g. user shared resources (favourites, compilations) including username
● Publishing pseudonymised datasets
○ They can be de-pseudonymised with external data
○ E.g. AOL, Netflix, ...
28
Tombstone line
● Produce dataset/stream of
forgotten users
● Egress components, e.g. online
service databases, may need
push for removal.
○ Higher PII leak risk
29
DB Service
The art of deletion
● Example: Cassandra
● Deletions == tombstones
● Data remains
○ Until compaction
○ In disconnected nodes
Component-specific expertise necessary
30
Deletion layers
● Every component adds deletion burden
○ Minimise number of components
○ Ephemeral >> dedicated. Recycle machines.
● Every storage layer adds deletion burden
○ Minimise number of storage layers
○ Cloud storage requires documented erasure semantics + agreements.
● Invent simple strategies
○ Example: Cycle Cassandra machines regularly, erase block devices.
Increasing cost of heterogeneity
31
Retention limitation
● Best solved in workflow manager
○ Connect creation and destruction
● Short default retention, whitelist exceptions
● In conflict with technical ideal of immutable raw data
32
Lake promotion
● Remove expire raw dataset, promote derived datasets to lake
● First derived dataset = washed(raw)?
● Workflow DAG still works
33
Data lake Derived
Cluster storage
Data lake Derived
Cluster storage
Lineage
● Tooling for tracking data flow
● Dataset granularity
○ Workflow manager?
● Field granularity
○ Framework instrumentation?
● Multiple use cases
○ (Discovering data)
○ (Pipeline change management)
○ Detecting dead end data flows
○ Right to export data
○ Explanation of model decisions
34
Solicitation: PII & lineage type systems
● Idea: decorate (scala) types
○ PII classification (red/yellow/green)
○ Lineage (e.g. processing class id + commit id + dataset revision)
● Assistance with PII arithmetics
○ PII[Red, String] + PII[Green, String] => PII[Red, String]
○ PII[Red, Int] + PII[Red, Int] => PII[Green, Int]
● Detect unused PII fields
● Assist with recomputation
○ For PII cleaning
○ Bug fixes
35
Resources Credits
● http://www.slideshare.net/lallea/d
ata-pipelines-from-zero-to-solid
● http://www.mapflat.com/lands/res
ources/reading-list
● https://ico.org.uk/
● EU Article 29 Working Party
36
● Alexander Kjeldaas, independent
● Lena Sundin, Spotify
● Oscar Söderlund, Spotify
● Oskar Löthberg, Spotify
● Sofia Edvardsen,
Sharp Cookie Advisors
● Øyvind Løkling,
Schibsted Media Group
1 of 36

Recommended

10 ways to stumble with big data by
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big dataLars Albertsson
1.4K views18 slides
Testing data streaming applications by
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
4K views26 slides
Data pipelines from zero to solid by
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
10.7K views58 slides
A primer on building real time data-driven products by
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven productsLars Albertsson
951 views17 slides
Engineering data quality by
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
1.3K views50 slides
Eventually, time will kill your data processing by
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processingLars Albertsson
413 views56 slides

More Related Content

What's hot

Taming the reproducibility crisis by
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisisLars Albertsson
521 views26 slides
Towards Data Operations by
Towards Data OperationsTowards Data Operations
Towards Data OperationsAndrea Monacchi
846 views23 slides
Data democratised by
Data democratisedData democratised
Data democratisedLars Albertsson
307 views32 slides
Data ops in practice - Swedish style by
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
408 views59 slides
Don't build a data science team by
Don't build a data science teamDon't build a data science team
Don't build a data science teamLars Albertsson
883 views35 slides
Continuous delivery for machine learning by
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
2.9K views42 slides

What's hot(20)

Taming the reproducibility crisis by Lars Albertsson
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
Lars Albertsson521 views
Data ops in practice - Swedish style by Lars Albertsson
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
Lars Albertsson408 views
Continuous delivery for machine learning by Rajesh Muppalla
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla2.9K views
The evolution of Netflix's S3 data warehouse (Strata NY 2018) by Ryan Blue
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Ryan Blue1.1K views
Streaming Distributed Data Processing with Silk #deim2014 by Taro L. Saito
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014
Taro L. Saito4.1K views
m2r2: A Framework for Results Materialization and Reuse by Vasia Kalavri
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reuse
Vasia Kalavri568 views
Iceberg: a fast table format for S3 by DataWorks Summit
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit7.5K views
The Dark Side Of Go -- Go runtime related problems in TiDB in production by PingCAP
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PingCAP444 views
Provenance as a building block for an open science infrastructure by Andreas Schreiber
Provenance as a building block for an open science infrastructureProvenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructure
Andreas Schreiber478 views
Big data processing systems research by Vasia Kalavri
Big data processing systems researchBig data processing systems research
Big data processing systems research
Vasia Kalavri1.3K views
Workflow Hacks #1 - dots. Tokyo by Taro L. Saito
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
Taro L. Saito3.1K views
Real Time Big Data by InfoFarm
Real Time Big DataReal Time Big Data
Real Time Big Data
InfoFarm1.4K views
MapReduce: Optimizations, Limitations, and Open Issues by Vasia Kalavri
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open Issues
Vasia Kalavri1.8K views

Viewers also liked

Test strategies for data processing pipelines, v2.0 by
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
2.7K views36 slides
Tomáš Hajzler: Přednáška Nová práce @ Teal Club by
Tomáš Hajzler: Přednáška Nová práce @ Teal ClubTomáš Hajzler: Přednáška Nová práce @ Teal Club
Tomáš Hajzler: Přednáška Nová práce @ Teal ClubTomáš Hajzler
1.8K views38 slides
Data Governance: Keystone of Information Management Initiatives by
Data Governance: Keystone of Information Management InitiativesData Governance: Keystone of Information Management Initiatives
Data Governance: Keystone of Information Management InitiativesAlan McSweeney
26.3K views102 slides
GDPR: Requirements for Cloud Providers by
GDPR: Requirements for Cloud ProvidersGDPR: Requirements for Cloud Providers
GDPR: Requirements for Cloud ProvidersIT Governance Ltd
2.1K views25 slides
EU GDPR and you: requirements for marketing by
EU GDPR and you: requirements for marketingEU GDPR and you: requirements for marketing
EU GDPR and you: requirements for marketingIT Governance Ltd
1.7K views29 slides
Tomáš Ervín Dombrovský: Pohyb ve světě (práce) by
Tomáš Ervín Dombrovský: Pohyb ve světě (práce)Tomáš Ervín Dombrovský: Pohyb ve světě (práce)
Tomáš Ervín Dombrovský: Pohyb ve světě (práce)Tomáš Hajzler
1.8K views29 slides

Viewers also liked(7)

Test strategies for data processing pipelines, v2.0 by Lars Albertsson
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Lars Albertsson2.7K views
Tomáš Hajzler: Přednáška Nová práce @ Teal Club by Tomáš Hajzler
Tomáš Hajzler: Přednáška Nová práce @ Teal ClubTomáš Hajzler: Přednáška Nová práce @ Teal Club
Tomáš Hajzler: Přednáška Nová práce @ Teal Club
Tomáš Hajzler1.8K views
Data Governance: Keystone of Information Management Initiatives by Alan McSweeney
Data Governance: Keystone of Information Management InitiativesData Governance: Keystone of Information Management Initiatives
Data Governance: Keystone of Information Management Initiatives
Alan McSweeney26.3K views
GDPR: Requirements for Cloud Providers by IT Governance Ltd
GDPR: Requirements for Cloud ProvidersGDPR: Requirements for Cloud Providers
GDPR: Requirements for Cloud Providers
IT Governance Ltd2.1K views
EU GDPR and you: requirements for marketing by IT Governance Ltd
EU GDPR and you: requirements for marketingEU GDPR and you: requirements for marketing
EU GDPR and you: requirements for marketing
IT Governance Ltd1.7K views
Tomáš Ervín Dombrovský: Pohyb ve světě (práce) by Tomáš Hajzler
Tomáš Ervín Dombrovský: Pohyb ve světě (práce)Tomáš Ervín Dombrovský: Pohyb ve světě (práce)
Tomáš Ervín Dombrovský: Pohyb ve světě (práce)
Tomáš Hajzler1.8K views
Revising policies and procedures under the new EU GDPR by IT Governance Ltd
Revising policies and procedures under the new EU GDPRRevising policies and procedures under the new EU GDPR
Revising policies and procedures under the new EU GDPR
IT Governance Ltd4.9K views

Similar to Protecting privacy in practice

Privacy by design by
Privacy by designPrivacy by design
Privacy by designLars Albertsson
1.9K views44 slides
Privacy by Design - Lars Albertsson, Mapflat by
Privacy by Design - Lars Albertsson, MapflatPrivacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, MapflatEvention
178 views32 slides
Big Data & Social Analytics presentation by
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentationgustavosouto
261 views66 slides
Splunk, SIEMs, and Big Data - The Undercroft - November 2019 by
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Jonathan Singer
93 views49 slides
Data engineering in 10 years.pdf by
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
839 views52 slides
Dirty data? Clean it up! - Datapalooza Denver 2016 by
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
1.3K views37 slides

Similar to Protecting privacy in practice(20)

Privacy by Design - Lars Albertsson, Mapflat by Evention
Privacy by Design - Lars Albertsson, MapflatPrivacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, Mapflat
Evention178 views
Big Data & Social Analytics presentation by gustavosouto
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentation
gustavosouto261 views
Splunk, SIEMs, and Big Data - The Undercroft - November 2019 by Jonathan Singer
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Jonathan Singer93 views
Data engineering in 10 years.pdf by Lars Albertsson
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
Lars Albertsson839 views
Dirty data? Clean it up! - Datapalooza Denver 2016 by Dan Lynn
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn1.3K views
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016 by Dan Lynn
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn457 views
Cloud Cost Management and Apache Spark with Xuan Wang by Databricks
Cloud Cost Management and Apache Spark with Xuan WangCloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan Wang
Databricks1.1K views
Presto Bangalore Meetup1 Repertoire@Myntra by Shubham Tagra
Presto Bangalore Meetup1 Repertoire@MyntraPresto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@Myntra
Shubham Tagra1.2K views
Data platform architecture principles - ieee infrastructure 2020 by Julien Le Dem
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem824 views
Big Data overview by alexisroos
Big Data overviewBig Data overview
Big Data overview
alexisroos1.6K views
Piano Media - approach to data gathering and processing by MartinStrycek
Piano Media - approach to data gathering and processingPiano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processing
MartinStrycek1.1K views
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned by Omid Vahdaty
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty1.2K views
Introduction to data science by Sampath Kumar
Introduction to data scienceIntroduction to data science
Introduction to data science
Sampath Kumar2K views
Intro to open source observability with grafana, prometheus, loki, and tempo(... by LibbySchulze
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...
LibbySchulze1.1K views
Counting Unique Users in Real-Time: Here's a Challenge for You! by DataWorks Summit
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit2.3K views
May The Data Stay with U! Network Data Exfiltration Techniques - Brucon 2017. by Leszek Mi?
May The Data Stay with U! Network Data Exfiltration Techniques - Brucon 2017.May The Data Stay with U! Network Data Exfiltration Techniques - Brucon 2017.
May The Data Stay with U! Network Data Exfiltration Techniques - Brucon 2017.
Leszek Mi?774 views

More from Lars Albertsson

Crossing the data divide by
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
3 views31 slides
Schema management with Scalameta by
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
5 views50 slides
How to not kill people - Berlin Buzzwords 2023.pdf by
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
34 views51 slides
The 7 habits of data effective companies.pdf by
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
251 views44 slides
Holistic data application quality by
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
396 views30 slides
Secure software supply chain on a shoestring budget by
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
268 views49 slides

More from Lars Albertsson(15)

How to not kill people - Berlin Buzzwords 2023.pdf by Lars Albertsson
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
Lars Albertsson34 views
The 7 habits of data effective companies.pdf by Lars Albertsson
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
Lars Albertsson251 views
Holistic data application quality by Lars Albertsson
Holistic data application qualityHolistic data application quality
Holistic data application quality
Lars Albertsson396 views
Secure software supply chain on a shoestring budget by Lars Albertsson
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
Lars Albertsson268 views
DataOps - Lean principles and lean practices by Lars Albertsson
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
Lars Albertsson787 views
The right side of speed - learning to shift left by Lars Albertsson
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
Lars Albertsson202 views
Eventually, time will kill your data pipeline by Lars Albertsson
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
Lars Albertsson936 views
Test strategies for data processing pipelines by Lars Albertsson
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
Lars Albertsson5.2K views
Building real time data-driven products by Lars Albertsson
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
Lars Albertsson2.8K views

Recently uploaded

JConWorld_ Continuous SQL with Kafka and Flink by
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
100 views36 slides
Advanced_Recommendation_Systems_Presentation.pptx by
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptxneeharikasingh29
5 views9 slides
Journey of Generative AI by
Journey of Generative AIJourney of Generative AI
Journey of Generative AIthomasjvarghese49
30 views37 slides
ColonyOS by
ColonyOSColonyOS
ColonyOSJohanKristiansson6
9 views17 slides
Organic Shopping in Google Analytics 4.pdf by
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdfGA4 Tutorials
10 views13 slides
3196 The Case of The East River by
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East RiverErickANDRADE90
11 views4 slides

Recently uploaded(20)

JConWorld_ Continuous SQL with Kafka and Flink by Timothy Spann
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann100 views
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials10 views
3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9011 views
Data structure and algorithm. by Abdul salam
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 18 views
Introduction to Microsoft Fabric.pdf by ishaniuudeshika
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika24 views
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20045 views
RuleBookForTheFairDataEconomy.pptx by noraelstela1
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela167 views
Survey on Factuality in LLM's.pptx by NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 views
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf by vikas12611618
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 views
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0118 views
Supercharging your Data with Azure AI Search and Azure OpenAI by Peter Gallagher
Supercharging your Data with Azure AI Search and Azure OpenAISupercharging your Data with Azure AI Search and Azure OpenAI
Supercharging your Data with Azure AI Search and Azure OpenAI
Peter Gallagher37 views
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views

Protecting privacy in practice

  • 1. Protecting privacy in practice 2017-06-13 Lars Albertsson www.mapflat.com 1
  • 2. Who’s talking? ● KTH-PDC Center for High Performance Computing (MSc thesis) ● Swedish Institute of Computer Science (distributed system test+debug tools) ● Sun Microsystems (building very large machines) ● Google (Hangouts, productivity) ● Recorded Future (natural language processing startup) ● Cinnober Financial Tech. (trading systems) ● Spotify (data processing & modelling) ● Schibsted Media Group (data processing & modelling) ● Mapflat (independent data engineering consultant) 2
  • 3. Privacy protection resources 3 All of this might go wrong. Large fine. Pour your data into our product. 404
  • 4. Privacy-driven design ● Technical scope ○ Toolbox ○ Not complete solutions ● Assuming that you solve: ○ Legal requirements ○ Security primitives ○ ... ● Not description of any company 4 Archi- tecture Statis- tics Legal Organi- sation Security Process Privacy UX Culture
  • 5. Requirements, engineer’s perspective ● Right to be forgotten ● Limited collection ● Limited retention ● Limited access ○ From employees ○ In case of security breach ● Right for explanations ● User data enumeration ● User data export 5
  • 6. Ancient data-centric systems ● The monolith ● All data in one place ● Analytics + online serving from single database ● Current state, mutable - Please delete me? - What data have you got on me? - Sure, no problem! 6 DB Presentation Logic Storage
  • 7. Event oriented systems 7 Every event All events, ever, raw, unprocessed Refinement pipeline Artifact of value
  • 8. Event oriented systems 8 Every event All events, ever, raw, unprocessed Refinement pipeline Artifact of value ● Motivated by ○ New types of data-driven (AI) features ○ Quicker product iterations ■ Data-driven product feedback (A/B tests) ■ Fewer teams involved in changes ○ Robustness - scales to more complex business logic Enable disruption
  • 9. v Service Data processing at scale 9 Cluster storage Ingress Offline processing Egress Data lake DB Service DatasetJob Pipeline Service Export Business intelligence DB DB Import
  • 10. Workflow manager 10 ● Dataset “build tool” ● Run job instance when ○ input is available ○ output missing ○ resources are available ● Backfill for previous failures ● DSL describes DAG ○ Includes ingress & egress Recommended: Luigi / Airflow DB
  • 11. Factors of success 11 ● Event-oriented - append only ● Immutability ● At-least-once semantics ● Reproducibility ○ Through 1000s of copies ● Redundancy - Please delete me? - What data have you got on me? - Hold on a second...
  • 12. Solution space 12 Technical feasibility Easy to do the right thing Awareness culture
  • 13. Personal information (PII) classification ● Red - sensitive data ○ Messages ○ GPS location ○ Views, preferences ● Yellow - personal data ○ IDs (user, device) ○ Name, email, address ○ IP address ● Green - insensitive data ○ Not related to persons ○ Aggregated numbers 13 ● Grey zone ○ Birth date, zip code ○ Recommendation / ads models? Nota bene: This is only an example classification
  • 14. PII arithmetics Red + green = red red + yellow = red yellow + green = yellow Aggregate(red/yellow) = green ? Green + green + green = yellow ? Yellow + yellow + yellow = red ? Machine_learning_model(yellow) = yellow ? Overfitting => persons could be identified 14
  • 15. Make privacy visible at ground level ● In dataset names ○ hdfs://red/crm/received_messages/year=2017/month=6/day=13 ○ s3://yellow/webshop/pageviews/year=2017/month=6/day=13 ● In field names ○ response.y_text = “Dear ” + user.y_name + “, thanks for contacting us …” ● In credential / service / table / ... names ● Spreads awareness ● Catch mistakes in code review ● Enables custom tooling for violation warnings 15
  • 16. Eye of the needle tool ● Provide data access through gateway tool ○ Thin wrapper around Spark/Hadoop/S3/... ○ Hard-wired configuration ● Governance ○ Access audit, verification ○ Policing/retention for experiment data 16
  • 17. Eye of the needle tool ● Easy to do the right thing ○ Right resource choice, e.g. “allocate temporary cluster/storage” ○ Enforce practices, e.g. run jobs from central repository code ○ No command for data download ● Enabling for data scientists ○ Empowered without operations ○ Directory of resources 17
  • 18. Towards oblivion ● Left to its own devices, personal (PII) data spreads like weed ● PII data needs to be governed, collared, or discarded ○ Discard what you can 18
  • 19. ● Discard all PII ○ User id in example ● No link between records or datasets ● Replace with non-PII ○ E.g. age, gender, country ● Still no link ○ Beware: rare combination => not anonymised Drop user id Discard: Anonymisation 19 Replace user id with demographics Useful for business insights Useful for metrics
  • 20. Partial discard: Pseudonymisation ● Hash PII ● Records are linked ○ Across datasets ○ Still PII, GDPR applies ○ Persons can be identified (with additional data) ○ Hash recoverable from PII ● Hash PII + salt ○ Hash not recoverable ● Records are still linked ○ Across datasets if salt is constant 20 Hash user id Useful for recommendations Hash user id + salt Useful for product insights
  • 21. ● Push reruns with workflow manager - No versioning support in tools - Computationally expensive - Easy to miss datasets + No data model changes required Governance: Recomputation 21
  • 22. ● Fields reference PII table ● Clear single record => oblivion - PII table injection needed - Key with UUID or hash - Extra join - Multiple or wide PII tables + Small PII leak risk Ejected record pattern 22
  • 23. ● Datasets are immutable ● Version n+1 of raw dataset lacks record ● Short retention of old versions ● Always depend on latest version Record removal in pipelines 23 User keys 2017-06-12 User keys 2017-06-13 class Purchases(Task): date = DateParameter() def requires(self): return [Users(self.date), Orders(self.date), UserKeys.latest()]
  • 24. ● PII fields encrypted ● Per-user decryption key table ● Clear single user key => oblivion - Extra join + decrypt - Decryption (user) id needed + Multi-field oblivion + Small PII leak risk Lost key pattern 24
  • 25. ● Different fields encrypted with different keys ● Partial user oblivion ○ E.g. forget my GPS coordinates Lost key partial oblivion 25
  • 26. ● Encrypt key fields that link datasets ● Ability to join is lost ● No data loss ○ Salt => anonymous data ○ No salt => pseudonymous data Lost link key 26
  • 27. Reversible oblivion ● Lost key pattern ● Give ejected record key to third party ○ User ○ Trusted organisation ● Destroy company copies 27
  • 28. Data model deadly sins ● Using PII data as key ○ Username, email ● Publishing entity ids containing PII data ○ E.g. user shared resources (favourites, compilations) including username ● Publishing pseudonymised datasets ○ They can be de-pseudonymised with external data ○ E.g. AOL, Netflix, ... 28
  • 29. Tombstone line ● Produce dataset/stream of forgotten users ● Egress components, e.g. online service databases, may need push for removal. ○ Higher PII leak risk 29 DB Service
  • 30. The art of deletion ● Example: Cassandra ● Deletions == tombstones ● Data remains ○ Until compaction ○ In disconnected nodes Component-specific expertise necessary 30
  • 31. Deletion layers ● Every component adds deletion burden ○ Minimise number of components ○ Ephemeral >> dedicated. Recycle machines. ● Every storage layer adds deletion burden ○ Minimise number of storage layers ○ Cloud storage requires documented erasure semantics + agreements. ● Invent simple strategies ○ Example: Cycle Cassandra machines regularly, erase block devices. Increasing cost of heterogeneity 31
  • 32. Retention limitation ● Best solved in workflow manager ○ Connect creation and destruction ● Short default retention, whitelist exceptions ● In conflict with technical ideal of immutable raw data 32
  • 33. Lake promotion ● Remove expire raw dataset, promote derived datasets to lake ● First derived dataset = washed(raw)? ● Workflow DAG still works 33 Data lake Derived Cluster storage Data lake Derived Cluster storage
  • 34. Lineage ● Tooling for tracking data flow ● Dataset granularity ○ Workflow manager? ● Field granularity ○ Framework instrumentation? ● Multiple use cases ○ (Discovering data) ○ (Pipeline change management) ○ Detecting dead end data flows ○ Right to export data ○ Explanation of model decisions 34
  • 35. Solicitation: PII & lineage type systems ● Idea: decorate (scala) types ○ PII classification (red/yellow/green) ○ Lineage (e.g. processing class id + commit id + dataset revision) ● Assistance with PII arithmetics ○ PII[Red, String] + PII[Green, String] => PII[Red, String] ○ PII[Red, Int] + PII[Red, Int] => PII[Green, Int] ● Detect unused PII fields ● Assist with recomputation ○ For PII cleaning ○ Bug fixes 35
  • 36. Resources Credits ● http://www.slideshare.net/lallea/d ata-pipelines-from-zero-to-solid ● http://www.mapflat.com/lands/res ources/reading-list ● https://ico.org.uk/ ● EU Article 29 Working Party 36 ● Alexander Kjeldaas, independent ● Lena Sundin, Spotify ● Oscar Söderlund, Spotify ● Oskar Löthberg, Spotify ● Sofia Edvardsen, Sharp Cookie Advisors ● Øyvind Løkling, Schibsted Media Group