SlideShare a Scribd company logo
Balancing Data Democracy with Data
Privacy: The LinkedIn Story
Jan 25, 2018
Eric Ogren
Anthony Hsu
Big Data Meetup, LinkedIn SF
1
We needed data democracy to
deliver member value
LinkedIn Data Science
I want to analyze as much data as
possible so my models are accurate
Data Democracy
ALL THE DATA, ALL THE TIME
I want to discover data that’s needed for my
analysis as fast as possible
I want to access that data as quickly as
possible for my analysis

2
I want my personal data to be stored only
where needed and not propagated
unnecessarily
Data Protection
Need to Ensure Member Privacy
LinkedIn Members
STORE, PROCESS, DELETE,..
I want my personal data to be deleted when
I close my account or request deletion
I want my personal data to only be
processed if essential and only if I consent
3
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
4
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
5
Data Hubs at LinkedIn
In Motion
At Rest
Scale
O(10) clusters
~2.3 Trillion messages / day
~450 TB written / day
Scale
O(10) clusters
~10K machines
~XXX PB at rest
6
In Motion
At Rest
Data Integration
SFTP
JDBC
REST
Azure
Blob, Data
Lake
Storage
7
REQUIREMENTS
Less Data
Legal: Right to Erasure or Right to be Forgotten
“Delete all my personal data without undue delay when it is no
longer necessary / when consent has been withdrawn”
Engineering:
Need the ability to delete some specific subset or all data associated
with a specific LinkedIn member from all our data systems
8
A lot of data, different formats
Challenges
Understand HDFS data: organization, formats, …
Cycle asynchronously, within an SLA, deleting
records, without affecting running jobs
Quarantine exceptional records for manual triage
Can scale to processing hundreds of PB of data
Data Deletion
IMPLICATIONS FOR HADOOP
9
Gobblin: The Logical Pipeline
Source
Work
Unit
Work
Unit
Work
Unit
Extract Convert Quality Write Data
Publish
WriteQualityConvertExtract
Extract Convert Quality Write
Task
Task
Task
10
Gobblin: Extending for Purge
HDFS
Work
Unit
Data
Publish
Extract Convert Quality Write
Task
Task
HDFS
If needs purge
then drop
else continue
Member’s Delete
Requests
11
STATUS AND CHALLENGES
Gobblin: Data Lifecycle Management at Scale
Status
Number of datasets: many thousands
Amount of data scanned for purge: hundreds of TB/day
Challenges
Immutable Storage Formats +  Right to Erasure = Unhappy Disks
“Widespread implementation will surely lead to innovation in these formats!”
12
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
DATA LIFECYCLE MANAGEMENT
13
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
DATA LIFECYCLE MANAGEMENT
14
LinkedIn’s Data Ecosystem
15
Metadata based Search Experience
for Data Scientists
Data Discovery
Where is dataset X?
How did it get created?
Usage : In production since 2014
Users : Data Scientists, Product Engineers
Use Cases: Discovery, Impact Analysis
WhereHows
FIND DATA, NAVIGATE RELATIONSHIPS
Open source @ github.com/linkedin/wherehows 16
SEARCH SCREENSHOTS
WhereHows
17
LINEAGE SCREENSHOTS
WhereHows
18
More than just Discovery
Use Cases
Which datasets at LinkedIn contain PII or highly
confidential data?
How many contain member-member messages?
How many of them are accessible by team X?
Have all datasets been purged within SLA?
Discovering Violations
ANSWERING HARDER QUESTIONS
19
Wide + Deep
Metadata
Comprehensive coverage of data systems at LinkedIn
We have > 20 systems!
SQL, NoSQL, Indexes, Blob Stores, …
Deeper understanding of each dataset
Schema is not enough
Need to understand semantics
Discovering Violations
REQUIREMENTS
20
A METADATA REFINERY APPROACH
WhereHows Architecture @ 10,000 ft
ML driven
refinements
21
METADATA SHOULD LOOK JUST LIKE DATA
WhereHows Architecture @ 10,000 ft
ML driven
refinements
Unified Metadata Dataset
Metadata Serving Repository
key-value
search
graph
Data Systems
Technical metadata
Snapshots
Stream
Services + Jobs
Operational
Metadata
WhereHows
Application
LinkedIn
Community AnnotationsTechnical metadata
Data Catalogs
Process Definitions
Code
Operational metadata
Data Publish
Data Access
Job Executions
22
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
DATA LIFECYCLE MANAGEMENT
METADATA
23
METADATA
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
DATA LIFECYCLE MANAGEMENT
24
Simple to Complex
Different Types
Basic Restrictions
Access to dataset based on business need
Privacy by Default
Analysts shouldn’t get access to raw PII
(Personally Identifiable Information) by default
Consent-based Access
Access to certain data elements only available
if member has consented for that particular use-
case
Access Restrictions
REQUIREMENTS
25
FREEDOM OF EXPRESSION
Many Transformation Engines @ LinkedIn
In Motion
At Rest
26
HARD TO CHANGE ANYTHING UNDERNEATH!
Challenge for Infrastructure Providers
(Pig scripts)
My Raw Data
Native readers, dependencies on path, format hard-coded
Hard to move to
better formats
without breaking
everyone or
copying data twice
My Raw Data
27
HARD TO CHANGE ANYTHING UPSTREAM!
Semantic Challenges
Data is unclean (bad data on certain dates)
Data models are in constant flux (split event into multiple)
Have to change
data processing
logic everywhere!
My Raw Data
28
AN API TO MANAGE EVOLUTION
We need “microservices” for Data
My Data API
My Raw Data
29
A DATA ACCESS LAYER FOR LINKEDIN
We built Dali to solve this
Dataset Readers
Dataset Tooling
Abstract away underlying physical details to
allow users to focus solely on the logical
concerns
30
Dali: Implementation Details in Context
Dataflow APIs
(MR, Spark,
Scalding)
Query Layers
(Pig, Hive,
Spark)
Data CatalogGit + Artifactory
Dataset
Owner
31
Datasets
+
UDFs
Dali Datasets (Tables+Views)
Dali Readers
STEP 1: DATA + METADATA
Solving for Compliant Access
Schema = {
int memberId
String firstName
String lastName
Position[] positions
educationHistory[] educationHistory
…
}
MemberProfile
NAME : is_pii
MEMBER_ID : is_pii
Raw
Dataset
Meta
Data
32
STEP 2: A MEMBER’S PREFERENCES
Privacy Preferences
33
A BITMAP DATASET: ONE PER MEMBER PER SETTING
Privacy Preferences
34
A BITMAP DATASET: ONE PER MEMBER PER SETTING
Privacy Preferences
35
Member Privacy
Preferences
Solving for Compliant Access With Dali
Raw
Dataset
Meta
Data
Member Privacy
Preferences
Dali Reader responsibility:
Given:
(Dataset, Metadata, UseCase)
Generate:
Dataset and Column-level
transformations
(obfuscate, null, …)
Auto-join with Member
Privacy Preferences
(filter out data elements that
are not consented to)
Processing
Logic
Dali
Reader
Library
Use
Case = X
36
Compliance Transformations: Under the Hood
37
Table Scan Operator
Filter Operator
Select Operator
Table Scan Operator
Filter Operator
Select Operator
GDPR Operator
Meta
Data
Query
Context
Privacy
Settings
Solving for Compliant Purging With Dali + Gobblin
Raw
Dataset
Meta
Data
Member Privacy
Preferences
Gobblin
Purger
Dali
Reader
Library
Use
Case =
Purge
Purged
Dataset
Member Delete
Requests
38
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
DATA LIFECYCLE MANAGEMENT
METADATA
DATA ACCESS LAYER
39
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox : Solved !
METADATA
DATA ACCESS LAYER
DATA LIFECYCLE MANAGEMENT
40
DATA DEMOCRACY + DATA PROTECTION
The Technology Blueprint
WhereHows*
Dali Apache Gobblin*
* Open Source : We can collaborate on these together!
DATA LIFECYCLE MANAGEMENTDATA ACCESS LAYER
METADATA
41
Thank You!
42

More Related Content

What's hot

Essbase training | Course Content
Essbase training | Course ContentEssbase training | Course Content
Essbase training | Course Content
Tech Thinkers Lab
 
Supporting GDPR Compliance through Data Classification
Supporting GDPR Compliance through Data ClassificationSupporting GDPR Compliance through Data Classification
Supporting GDPR Compliance through Data Classification
Index Engines Inc.
 
Webinar: Practical Technology Playbook for the GDPR
Webinar: Practical Technology Playbook for the GDPRWebinar: Practical Technology Playbook for the GDPR
Webinar: Practical Technology Playbook for the GDPR
Index Engines Inc.
 
Tackling the GDPR Dell EMC Index Engines Webinar
Tackling the GDPR Dell EMC Index Engines WebinarTackling the GDPR Dell EMC Index Engines Webinar
Tackling the GDPR Dell EMC Index Engines Webinar
Index Engines Inc.
 
Cleaning up Redundant, Obsolete and Trivial Data to Reclaim Capacity and Mana...
Cleaning up Redundant, Obsolete and Trivial Data to Reclaim Capacity and Mana...Cleaning up Redundant, Obsolete and Trivial Data to Reclaim Capacity and Mana...
Cleaning up Redundant, Obsolete and Trivial Data to Reclaim Capacity and Mana...
Index Engines Inc.
 
DBAs - Is Your Company’s Personal and Sensitive Data Safe?
DBAs - Is Your Company’s Personal and Sensitive Data Safe?DBAs - Is Your Company’s Personal and Sensitive Data Safe?
DBAs - Is Your Company’s Personal and Sensitive Data Safe?
DevOps.com
 
RDA Update
RDA UpdateRDA Update
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & ReuseLaurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
GigaScience, BGI Hong Kong
 
RDA, EOSC and FAIR
RDA, EOSC and FAIRRDA, EOSC and FAIR
RDA, EOSC and FAIR
EUDAT
 
Eu gdpr technical workflow and productionalization neccessary w privacy ass...
Eu gdpr technical workflow and productionalization   neccessary w privacy ass...Eu gdpr technical workflow and productionalization   neccessary w privacy ass...
Eu gdpr technical workflow and productionalization neccessary w privacy ass...
Steven Meister
 
Presentation IS
Presentation ISPresentation IS
Presentation IS
yanacoolen
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
dallemang
 

What's hot (12)

Essbase training | Course Content
Essbase training | Course ContentEssbase training | Course Content
Essbase training | Course Content
 
Supporting GDPR Compliance through Data Classification
Supporting GDPR Compliance through Data ClassificationSupporting GDPR Compliance through Data Classification
Supporting GDPR Compliance through Data Classification
 
Webinar: Practical Technology Playbook for the GDPR
Webinar: Practical Technology Playbook for the GDPRWebinar: Practical Technology Playbook for the GDPR
Webinar: Practical Technology Playbook for the GDPR
 
Tackling the GDPR Dell EMC Index Engines Webinar
Tackling the GDPR Dell EMC Index Engines WebinarTackling the GDPR Dell EMC Index Engines Webinar
Tackling the GDPR Dell EMC Index Engines Webinar
 
Cleaning up Redundant, Obsolete and Trivial Data to Reclaim Capacity and Mana...
Cleaning up Redundant, Obsolete and Trivial Data to Reclaim Capacity and Mana...Cleaning up Redundant, Obsolete and Trivial Data to Reclaim Capacity and Mana...
Cleaning up Redundant, Obsolete and Trivial Data to Reclaim Capacity and Mana...
 
DBAs - Is Your Company’s Personal and Sensitive Data Safe?
DBAs - Is Your Company’s Personal and Sensitive Data Safe?DBAs - Is Your Company’s Personal and Sensitive Data Safe?
DBAs - Is Your Company’s Personal and Sensitive Data Safe?
 
RDA Update
RDA UpdateRDA Update
RDA Update
 
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & ReuseLaurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
 
RDA, EOSC and FAIR
RDA, EOSC and FAIRRDA, EOSC and FAIR
RDA, EOSC and FAIR
 
Eu gdpr technical workflow and productionalization neccessary w privacy ass...
Eu gdpr technical workflow and productionalization   neccessary w privacy ass...Eu gdpr technical workflow and productionalization   neccessary w privacy ass...
Eu gdpr technical workflow and productionalization neccessary w privacy ass...
 
Presentation IS
Presentation ISPresentation IS
Presentation IS
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 

Similar to Balancing Data Democracy with Data Privacy: The LinkedIn Story

Tamingthecompliancebeast stratanyc-171001162525
Tamingthecompliancebeast stratanyc-171001162525Tamingthecompliancebeast stratanyc-171001162525
Tamingthecompliancebeast stratanyc-171001162525
Charan Sai
 
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Shirshanka Das
 
Data Privacy at Scale
Data Privacy at ScaleData Privacy at Scale
Data Privacy at Scale
DataWorks Summit
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
Jun Rao
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
markgrover
 
Enterprise Data Marketplace: A Centralized Portal for All Your Data Assets
Enterprise Data Marketplace: A Centralized Portal for All Your Data AssetsEnterprise Data Marketplace: A Centralized Portal for All Your Data Assets
Enterprise Data Marketplace: A Centralized Portal for All Your Data Assets
Denodo
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo
 
Unit 2
Unit 2Unit 2
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Denodo
 
LinkedIn2
LinkedIn2LinkedIn2
Qiagram
QiagramQiagram
Qiagram
jwppz
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bhaskar Ghosh
 
Denodo Platform 7.0: What's New?
Denodo Platform 7.0: What's New?Denodo Platform 7.0: What's New?
Denodo Platform 7.0: What's New?
Denodo
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
TJ Stalcup
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
Inside Analysis
 
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
Denodo
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
Prof.Balakrishnan S
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with Cloudera
Caserta
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
Amy W. Tang
 
Decision Ready Data: Power Your Analytics with Great Data
Decision Ready Data: Power Your Analytics with Great DataDecision Ready Data: Power Your Analytics with Great Data
Decision Ready Data: Power Your Analytics with Great Data
DLT Solutions
 

Similar to Balancing Data Democracy with Data Privacy: The LinkedIn Story (20)

Tamingthecompliancebeast stratanyc-171001162525
Tamingthecompliancebeast stratanyc-171001162525Tamingthecompliancebeast stratanyc-171001162525
Tamingthecompliancebeast stratanyc-171001162525
 
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
 
Data Privacy at Scale
Data Privacy at ScaleData Privacy at Scale
Data Privacy at Scale
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Enterprise Data Marketplace: A Centralized Portal for All Your Data Assets
Enterprise Data Marketplace: A Centralized Portal for All Your Data AssetsEnterprise Data Marketplace: A Centralized Portal for All Your Data Assets
Enterprise Data Marketplace: A Centralized Portal for All Your Data Assets
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
 
Unit 2
Unit 2Unit 2
Unit 2
 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
 
LinkedIn2
LinkedIn2LinkedIn2
LinkedIn2
 
Qiagram
QiagramQiagram
Qiagram
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
 
Denodo Platform 7.0: What's New?
Denodo Platform 7.0: What's New?Denodo Platform 7.0: What's New?
Denodo Platform 7.0: What's New?
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
 
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with Cloudera
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Decision Ready Data: Power Your Analytics with Great Data
Decision Ready Data: Power Your Analytics with Great DataDecision Ready Data: Power Your Analytics with Great Data
Decision Ready Data: Power Your Analytics with Great Data
 

Recently uploaded

Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
inaya7568
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 

Recently uploaded (20)

Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 

Balancing Data Democracy with Data Privacy: The LinkedIn Story

  • 1. Balancing Data Democracy with Data Privacy: The LinkedIn Story Jan 25, 2018 Eric Ogren Anthony Hsu Big Data Meetup, LinkedIn SF 1
  • 2. We needed data democracy to deliver member value LinkedIn Data Science I want to analyze as much data as possible so my models are accurate Data Democracy ALL THE DATA, ALL THE TIME I want to discover data that’s needed for my analysis as fast as possible I want to access that data as quickly as possible for my analysis
 2
  • 3. I want my personal data to be stored only where needed and not propagated unnecessarily Data Protection Need to Ensure Member Privacy LinkedIn Members STORE, PROCESS, DELETE,.. I want my personal data to be deleted when I close my account or request deletion I want my personal data to only be processed if essential and only if I consent 3
  • 4. DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox 4
  • 5. DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox 5
  • 6. Data Hubs at LinkedIn In Motion At Rest Scale O(10) clusters ~2.3 Trillion messages / day ~450 TB written / day Scale O(10) clusters ~10K machines ~XXX PB at rest 6
  • 7. In Motion At Rest Data Integration SFTP JDBC REST Azure Blob, Data Lake Storage 7
  • 8. REQUIREMENTS Less Data Legal: Right to Erasure or Right to be Forgotten “Delete all my personal data without undue delay when it is no longer necessary / when consent has been withdrawn” Engineering: Need the ability to delete some specific subset or all data associated with a specific LinkedIn member from all our data systems 8
  • 9. A lot of data, different formats Challenges Understand HDFS data: organization, formats, … Cycle asynchronously, within an SLA, deleting records, without affecting running jobs Quarantine exceptional records for manual triage Can scale to processing hundreds of PB of data Data Deletion IMPLICATIONS FOR HADOOP 9
  • 10. Gobblin: The Logical Pipeline Source Work Unit Work Unit Work Unit Extract Convert Quality Write Data Publish WriteQualityConvertExtract Extract Convert Quality Write Task Task Task 10
  • 11. Gobblin: Extending for Purge HDFS Work Unit Data Publish Extract Convert Quality Write Task Task HDFS If needs purge then drop else continue Member’s Delete Requests 11
  • 12. STATUS AND CHALLENGES Gobblin: Data Lifecycle Management at Scale Status Number of datasets: many thousands Amount of data scanned for purge: hundreds of TB/day Challenges Immutable Storage Formats +  Right to Erasure = Unhappy Disks “Widespread implementation will surely lead to innovation in these formats!” 12
  • 13. DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox DATA LIFECYCLE MANAGEMENT 13
  • 14. DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox DATA LIFECYCLE MANAGEMENT 14
  • 16. Metadata based Search Experience for Data Scientists Data Discovery Where is dataset X? How did it get created? Usage : In production since 2014 Users : Data Scientists, Product Engineers Use Cases: Discovery, Impact Analysis WhereHows FIND DATA, NAVIGATE RELATIONSHIPS Open source @ github.com/linkedin/wherehows 16
  • 19. More than just Discovery Use Cases Which datasets at LinkedIn contain PII or highly confidential data? How many contain member-member messages? How many of them are accessible by team X? Have all datasets been purged within SLA? Discovering Violations ANSWERING HARDER QUESTIONS 19
  • 20. Wide + Deep Metadata Comprehensive coverage of data systems at LinkedIn We have > 20 systems! SQL, NoSQL, Indexes, Blob Stores, … Deeper understanding of each dataset Schema is not enough Need to understand semantics Discovering Violations REQUIREMENTS 20
  • 21. A METADATA REFINERY APPROACH WhereHows Architecture @ 10,000 ft ML driven refinements 21
  • 22. METADATA SHOULD LOOK JUST LIKE DATA WhereHows Architecture @ 10,000 ft ML driven refinements Unified Metadata Dataset Metadata Serving Repository key-value search graph Data Systems Technical metadata Snapshots Stream Services + Jobs Operational Metadata WhereHows Application LinkedIn Community AnnotationsTechnical metadata Data Catalogs Process Definitions Code Operational metadata Data Publish Data Access Job Executions 22
  • 23. DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox DATA LIFECYCLE MANAGEMENT METADATA 23
  • 24. METADATA DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox DATA LIFECYCLE MANAGEMENT 24
  • 25. Simple to Complex Different Types Basic Restrictions Access to dataset based on business need Privacy by Default Analysts shouldn’t get access to raw PII (Personally Identifiable Information) by default Consent-based Access Access to certain data elements only available if member has consented for that particular use- case Access Restrictions REQUIREMENTS 25
  • 26. FREEDOM OF EXPRESSION Many Transformation Engines @ LinkedIn In Motion At Rest 26
  • 27. HARD TO CHANGE ANYTHING UNDERNEATH! Challenge for Infrastructure Providers (Pig scripts) My Raw Data Native readers, dependencies on path, format hard-coded Hard to move to better formats without breaking everyone or copying data twice My Raw Data 27
  • 28. HARD TO CHANGE ANYTHING UPSTREAM! Semantic Challenges Data is unclean (bad data on certain dates) Data models are in constant flux (split event into multiple) Have to change data processing logic everywhere! My Raw Data 28
  • 29. AN API TO MANAGE EVOLUTION We need “microservices” for Data My Data API My Raw Data 29
  • 30. A DATA ACCESS LAYER FOR LINKEDIN We built Dali to solve this Dataset Readers Dataset Tooling Abstract away underlying physical details to allow users to focus solely on the logical concerns 30
  • 31. Dali: Implementation Details in Context Dataflow APIs (MR, Spark, Scalding) Query Layers (Pig, Hive, Spark) Data CatalogGit + Artifactory Dataset Owner 31 Datasets + UDFs Dali Datasets (Tables+Views) Dali Readers
  • 32. STEP 1: DATA + METADATA Solving for Compliant Access Schema = { int memberId String firstName String lastName Position[] positions educationHistory[] educationHistory … } MemberProfile NAME : is_pii MEMBER_ID : is_pii Raw Dataset Meta Data 32
  • 33. STEP 2: A MEMBER’S PREFERENCES Privacy Preferences 33
  • 34. A BITMAP DATASET: ONE PER MEMBER PER SETTING Privacy Preferences 34
  • 35. A BITMAP DATASET: ONE PER MEMBER PER SETTING Privacy Preferences 35 Member Privacy Preferences
  • 36. Solving for Compliant Access With Dali Raw Dataset Meta Data Member Privacy Preferences Dali Reader responsibility: Given: (Dataset, Metadata, UseCase) Generate: Dataset and Column-level transformations (obfuscate, null, …) Auto-join with Member Privacy Preferences (filter out data elements that are not consented to) Processing Logic Dali Reader Library Use Case = X 36
  • 37. Compliance Transformations: Under the Hood 37 Table Scan Operator Filter Operator Select Operator Table Scan Operator Filter Operator Select Operator GDPR Operator Meta Data Query Context Privacy Settings
  • 38. Solving for Compliant Purging With Dali + Gobblin Raw Dataset Meta Data Member Privacy Preferences Gobblin Purger Dali Reader Library Use Case = Purge Purged Dataset Member Delete Requests 38
  • 39. DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox DATA LIFECYCLE MANAGEMENT METADATA DATA ACCESS LAYER 39
  • 40. DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox : Solved ! METADATA DATA ACCESS LAYER DATA LIFECYCLE MANAGEMENT 40
  • 41. DATA DEMOCRACY + DATA PROTECTION The Technology Blueprint WhereHows* Dali Apache Gobblin* * Open Source : We can collaborate on these together! DATA LIFECYCLE MANAGEMENTDATA ACCESS LAYER METADATA 41