Data Privacy at Scale

Data Privacy at
Scale
Anthony Hsu, Issac Buenrostro
LinkedIn
DataWorks Summit, June 20, 2018

About the speakers
Issac Buenrostro
Staff Software Engineer
LinkedIn
Apache Gobblin
Anthony Hsu
Staff Software Engineer
LinkedIn
Dali / Data access

Agenda
• Design challenges in privacy
• Building a privacy system:
• What to delete?
• How to clean a dataset?
• How to purge a large data system?
• Beyond HDFS
• Conclusion

• Building a privacy system:
• What to delete?
• Beyond HDFS
• Conclusion

LinkedIn's Vision
Create economic opportunity for every member of the global
workforce
20M
COMPANIES
15M
JOBS
50K
SKILLS
60K
SCHOOLS
560M
MEMBERS

LinkedIn's
Privacy
Paradox
“On one hand, the company has 500+ million
members trusting the company to protect highly
sensitive data.
On the other hand, one only joins the largest
professional network on the Internet because they
want to be found!"
Kalinda Raina,
Head of Global Privacy, LinkedIn

Central tenets for compliance
framework

Complianc
e vision
• Compliance of every dataset regardless of
format, schema, platform.
• Purge records by arbitrary IDs (LinkedIn
members, Lynda members, corporate
seats, etc.)
• Have reasonable defaults for how datasets
are purged
• Let owners customize how their dataset is
purged via easy-to-write grammar
• Detect violations, mis-tagging, required
customizations, etc.

Offline data scale at LinkedIn
• A dozen Hadoop clusters
• 50k+ datasets
• 15+ PB spread across clusters (3x replicated)
• 100+ TB ingested daily
• 1000s of daily Hadoop users
• 30K+ daily Hadoop flows
• 100K+ daily YARN jobs

HDFS Specific Challenges
HDFS is append only
• Deleting a record in the middle of a file requires rewriting the
entire block
How do we efficiently update PBs of data?
• Batch deletes and process them together in a single scan
through the data
• Leverage the Apache Gobblin framework for parallelizing work
and maintaining state

• Building a privacy system
• What to delete?
• Beyond HDFS
• Conclusion

What to delete?
• Thousands of datasets
• Append style data, mutable data, etc.
• Different fields contain purgeable entities
• Start from first principles
• Collect metadata on every single dataset on LinkedIn

Collecting Metadata
• WhereHows is a dataset catalog
• Natural place to also collect compliance metadata
• Ask data owners to specify field tags for all datasets
• Automated annotation of common fields such as event
headers

Metadata Challenges
• Incomplete, incorrect, missing metadata
• Risk of deleting wrong data
• Business/legal requirements
• How to handle non-trivial types: arrays, maps, …
• Custom types and formats
• Member id could be 123, urn:member:123, member_123, etc.
• Composite URNs: urn:customUrn:(<memberId>,<customId>)

Which records to delete?
• Users send Kafka events with IDs needs to be purged from LinkedIn
• Security in place to prevent rogue requests.
• Requests stored in heavily compressed lookup tables
• Requests can be applied globally, or restricted to specific datasets, groups
of datasets, or fields.
• Lookup tables available at runtime via lookup_table('<entity>') UDF.
Purge
Reques
t
Lookup
Table
Store
lookup_table('member')

• What to delete?
• How to delete?
• Beyond HDFS
• Conclusion

How to delete
Dataset
'member'
Lookup
Table
Map
-Join
Filter

SQL-based rules
• Most dataset owners are familiar with SQL
• Preferable for many analysts over Java
• SQL is simple yet expressive
• SQL supports custom UDFs (user-defined functions)
Row Filter Column Transformations
Select rows that should be deleted
WHERE clause
Replace field values by result of
expression
SELECT clause

Dynamic dataset customization
• Assign SQL rules to datasets, dataset groups, or use cases
• Dynamically compose all applicable rules
• Templating UDFs:
• default_field_value()
• row_timestamp()
• pii_fields()

Purging a record
Complianc
e Operator
Resolved
Expression
s
Hive
Input
Forma
t
Hive
Table
Scan
Datase
t
Record
Consumer

Compliance Operator
Lookup
Table
Store
Metadata
Store
SQL
Rules
Query
Context
(dataset,
user,…)
Compliance
Operator

Purging a Dataset
Datase
t
Clean
Datase
t
Complianc
e Operator
Replace
High Security
ZonePurged
Record
s
Retentio
n

Why Apache Hive?
• SQL parser and evaluator tightly integrated with Hadoop ecosystem
• Supports multiple data formats
• Some LinkedIn tools already leverage Hive
• Dali (Data Access at LinkedIn): Provides storage agnostic access to datasets and views

Apache Gobblin
• Gobblin is a distributed data integration framework that simplifies common
aspects of big data integration.
• Provides features required for operable purger pipeline:
• Record processing
• State management
• Metric/event emission
• Data retention
• Mix-and-match of readers, compliance operator, and writers.
• …

Purging a data system
Metadat
a Store
Dataset 1
- Part 1
- Part 2
- Part 3
Dataset 2
- Part 1
- Part 2
Dataset 3
Dataset 4
…
Gobblin
State
Store
Rememb
er where
we left
off
Audit

Auditing Compliance
• Emit audit events every time a dataset is cleaned
• Allows tracking of when a dataset was cleaned and when purge requests have been
applied
• Can detect when a dataset has not been cleaned in a while
• Also emit error events containing failure details
• Notify data owners to fix dataset metadata/customizations.

• What to delete?
• How to purge a very large data ecosystem?
• Beyond HDFS
• Conclusion

Read-side compliance
Datase
t
Complianc
e Operator
Dali
Reader
User Logic
(Spark,
Pig, ...)
• Dali – LinkedIn's data access layer, allows accessing datasets from any
framework
• Dynamic read-side filtering
• Allows different views of the same data
• Allows for shorter SLA: immediate read-time compliance before data has been purged
• Dynamic data obfuscation
• Queries can see the data, but no identifying information

Generic Data Store Purging
Three-step process:
•Gobblin dumps snapshot into HDFS
•Compliance library selects primary keys to delete/modify
•Gobblin applies changes to data store
Data
Store
HDFS

Gobblin Hive
Compliance Framework
Wherehows
SQL Audit
Dali

Resources
• Gobblin
• https://gobblin.apache.org/
• https://github.com/apache/incubator-gobblin
• Dali
• https://engineering.linkedin.com/teams/data/projects/dali
• WhereHows
• https://github.com/linkedin/WhereHows

Acknowledgements
• LinkedIn Teams:
• Gobblin
• Dali
• Metadata
• Trust Engineering
• Legal
• House Security
• Applications

Data Privacy at Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Privacy at Scale

Similar to Data Privacy at Scale (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Data Privacy at Scale

Editor's Notes