Data Privacy at
Scale
Anthony Hsu, Issac Buenrostro
LinkedIn
DataWorks Summit, June 20, 2018
About the speakers
Issac Buenrostro
Staff Software Engineer
LinkedIn
Apache Gobblin
Anthony Hsu
Staff Software Engineer
LinkedIn
Dali / Data access
Agenda
• Design challenges in privacy
• Building a privacy system:
• What to delete?
• How to clean a dataset?
• How to purge a large data system?
• Beyond HDFS
• Conclusion
• Design challenges in privacy​
• Building a privacy system:
• What to delete?
• How to clean a dataset?
• How to purge a large data system?
• Beyond HDFS
• Conclusion
LinkedIn's Vision
Create economic opportunity for every member of the global
workforce
20M
COMPANIES
15M
JOBS
50K
SKILLS
60K
SCHOOLS
560M
MEMBERS
LinkedIn's
Privacy
Paradox
“On one hand, the company has 500+ million
members trusting the company to protect highly
sensitive data.
On the other hand, one only joins the largest
professional network on the Internet because they
want to be found!"
Kalinda Raina,
Head of Global Privacy, LinkedIn
Central tenets for compliance
framework
Complianc
e vision
• Compliance of every dataset regardless of
format, schema, platform.
• Purge records by arbitrary IDs (LinkedIn
members, Lynda members, corporate
seats, etc.)
• Have reasonable defaults for how datasets
are purged
• Let owners customize how their dataset is
purged via easy-to-write grammar
• Detect violations, mis-tagging, required
customizations, etc.
Design
Challenges
Offline data scale at LinkedIn
• A dozen Hadoop clusters
• 50k+ datasets
• 15+ PB spread across clusters (3x replicated)
• 100+ TB ingested daily
• 1000s of daily Hadoop users
• 30K+ daily Hadoop flows
• 100K+ daily YARN jobs
HDFS Specific Challenges
HDFS is append only
• Deleting a record in the middle of a file requires rewriting the
entire block
How do we efficiently update PBs of data?
• Batch deletes and process them together in a single scan
through the data
• Leverage the Apache Gobblin framework for parallelizing work
and maintaining state
Building a
privacy system
• Design challenges in privacy
• Building a privacy system
• What to delete?
• How to clean a dataset?
• How to purge a large data system?
• Beyond HDFS
• Conclusion
What to delete?
• Thousands of datasets
• Append style data, mutable data, etc.
• Different fields contain purgeable entities
• Start from first principles
• Collect metadata on every single dataset on LinkedIn
Collecting Metadata
• WhereHows is a dataset catalog
• Natural place to also collect compliance metadata
• Ask data owners to specify field tags for all datasets
• Automated annotation of common fields such as event
headers
Dataset Metadata
Metadata Challenges
• Incomplete, incorrect, missing metadata
• Risk of deleting wrong data
• Business/legal requirements
• How to handle non-trivial types: arrays, maps, …
• Custom types and formats
• Member id could be 123, urn:member:123, member_123, etc.
• Composite URNs: urn:customUrn:(<memberId>,<customId>)
Which records to delete?
• Users send Kafka events with IDs needs to be purged from LinkedIn
• Security in place to prevent rogue requests.
• Requests stored in heavily compressed lookup tables
• Requests can be applied globally, or restricted to specific datasets, groups
of datasets, or fields.
• Lookup tables available at runtime via lookup_table('<entity>') UDF.
Purge
Reques
t
Lookup
Table
Store
lookup_table('member')
• Design challenges in privacy
• Building a privacy system
• What to delete?
• How to delete?
• How to purge a large data system?
• Beyond HDFS
• Conclusion
How to delete
Dataset
'member'
Lookup
Table
Map
-Join
Filter
SQL-based rules
• Most dataset owners are familiar with SQL
• Preferable for many analysts over Java
• SQL is simple yet expressive
• SQL supports custom UDFs (user-defined functions)
Row Filter Column Transformations
Select rows that should be deleted
WHERE clause
Replace field values by result of
expression
SELECT clause
Dynamic dataset customization
• Assign SQL rules to datasets, dataset groups, or use cases
• Dynamically compose all applicable rules
• Templating UDFs:
• default_field_value()
• row_timestamp()
• pii_fields()
• Design challenges in privacy
• Building a privacy system
• What to delete?
• How to delete?
• How to purge a large data system?
• Beyond HDFS
• Conclusion
Purging a record
Complianc
e Operator
Resolved
Expression
s
Hive
Input
Forma
t
Hive
Table
Scan
Datase
t
Record
Consumer
Compliance Operator
Lookup
Table
Store
Metadata
Store
SQL
Rules
Query
Context
(dataset,
user,…)
Compliance
Operator
Purging a Dataset
Datase
t
Clean
Datase
t
Complianc
e Operator
Replace
High Security
ZonePurged
Record
s
Retentio
n
Why Apache Hive?
• SQL parser and evaluator tightly integrated with Hadoop ecosystem
• Supports multiple data formats
• Some LinkedIn tools already leverage Hive
• Dali (Data Access at LinkedIn): Provides storage agnostic access to datasets and views
Apache Gobblin
• Gobblin is a distributed data integration framework that simplifies common
aspects of big data integration.
• Provides features required for operable purger pipeline:
• Record processing
• State management
• Metric/event emission
• Data retention
• Mix-and-match of readers, compliance operator, and writers.
• …
Purging a data system
Metadat
a Store
Dataset 1
- Part 1
- Part 2
- Part 3
Dataset 2
- Part 1
- Part 2
Dataset 3
Dataset 4
…
Gobblin
State
Store
Rememb
er where
we left
off
Audit
Auditing Compliance
• Emit audit events every time a dataset is cleaned
• Allows tracking of when a dataset was cleaned and when purge requests have been
applied
• Can detect when a dataset has not been cleaned in a while
• Also emit error events containing failure details
• Notify data owners to fix dataset metadata/customizations.
• Design challenges in privacy
• Building a privacy system
• What to delete?
• How to clean a dataset?
• How to purge a very large data ecosystem?
• Beyond HDFS
• Conclusion
Read-side compliance
Datase
t
Complianc
e Operator
Dali
Reader
User Logic
(Spark,
Pig, ...)
• Dali – LinkedIn's data access layer, allows accessing datasets from any
framework
• Dynamic read-side filtering
• Allows different views of the same data
• Allows for shorter SLA: immediate read-time compliance before data has been purged
• Dynamic data obfuscation
• Queries can see the data, but no identifying information
Generic Data Store Purging
Three-step process:
•Gobblin dumps snapshot into HDFS
•Compliance library selects primary keys to delete/modify
•Gobblin applies changes to data store
Data
Store
HDFS
• Design challenges in privacy
• Building a privacy system
• What to delete?
• How to clean a dataset?
• How to purge a very large data ecosystem?
• Beyond HDFS
• Conclusion
Gobblin Hive
Compliance Framework
Wherehows
SQL Audit
Dali
Resources
• Gobblin
• https://gobblin.apache.org/
• https://github.com/apache/incubator-gobblin
• Dali
• https://engineering.linkedin.com/teams/data/projects/dali
• WhereHows
• https://github.com/linkedin/WhereHows
Acknowledgements
• LinkedIn Teams:
• Gobblin
• Dali
• Metadata
• Trust Engineering
• Legal
• House Security
• Applications
Thank you!

Data Privacy at Scale

  • 1.
    Data Privacy at Scale AnthonyHsu, Issac Buenrostro LinkedIn DataWorks Summit, June 20, 2018
  • 2.
    About the speakers IssacBuenrostro Staff Software Engineer LinkedIn Apache Gobblin Anthony Hsu Staff Software Engineer LinkedIn Dali / Data access
  • 3.
    Agenda • Design challengesin privacy • Building a privacy system: • What to delete? • How to clean a dataset? • How to purge a large data system? • Beyond HDFS • Conclusion
  • 4.
    • Design challengesin privacy​ • Building a privacy system: • What to delete? • How to clean a dataset? • How to purge a large data system? • Beyond HDFS • Conclusion
  • 5.
    LinkedIn's Vision Create economicopportunity for every member of the global workforce 20M COMPANIES 15M JOBS 50K SKILLS 60K SCHOOLS 560M MEMBERS
  • 6.
    LinkedIn's Privacy Paradox “On one hand,the company has 500+ million members trusting the company to protect highly sensitive data. On the other hand, one only joins the largest professional network on the Internet because they want to be found!" Kalinda Raina, Head of Global Privacy, LinkedIn
  • 7.
    Central tenets forcompliance framework
  • 8.
    Complianc e vision • Complianceof every dataset regardless of format, schema, platform. • Purge records by arbitrary IDs (LinkedIn members, Lynda members, corporate seats, etc.) • Have reasonable defaults for how datasets are purged • Let owners customize how their dataset is purged via easy-to-write grammar • Detect violations, mis-tagging, required customizations, etc.
  • 9.
  • 10.
    Offline data scaleat LinkedIn • A dozen Hadoop clusters • 50k+ datasets • 15+ PB spread across clusters (3x replicated) • 100+ TB ingested daily • 1000s of daily Hadoop users • 30K+ daily Hadoop flows • 100K+ daily YARN jobs
  • 11.
    HDFS Specific Challenges HDFSis append only • Deleting a record in the middle of a file requires rewriting the entire block How do we efficiently update PBs of data? • Batch deletes and process them together in a single scan through the data • Leverage the Apache Gobblin framework for parallelizing work and maintaining state
  • 12.
  • 13.
    • Design challengesin privacy • Building a privacy system • What to delete? • How to clean a dataset? • How to purge a large data system? • Beyond HDFS • Conclusion
  • 14.
    What to delete? •Thousands of datasets • Append style data, mutable data, etc. • Different fields contain purgeable entities • Start from first principles • Collect metadata on every single dataset on LinkedIn
  • 15.
    Collecting Metadata • WhereHowsis a dataset catalog • Natural place to also collect compliance metadata • Ask data owners to specify field tags for all datasets • Automated annotation of common fields such as event headers
  • 16.
  • 17.
    Metadata Challenges • Incomplete,incorrect, missing metadata • Risk of deleting wrong data • Business/legal requirements • How to handle non-trivial types: arrays, maps, … • Custom types and formats • Member id could be 123, urn:member:123, member_123, etc. • Composite URNs: urn:customUrn:(<memberId>,<customId>)
  • 18.
    Which records todelete? • Users send Kafka events with IDs needs to be purged from LinkedIn • Security in place to prevent rogue requests. • Requests stored in heavily compressed lookup tables • Requests can be applied globally, or restricted to specific datasets, groups of datasets, or fields. • Lookup tables available at runtime via lookup_table('<entity>') UDF. Purge Reques t Lookup Table Store lookup_table('member')
  • 19.
    • Design challengesin privacy • Building a privacy system • What to delete? • How to delete? • How to purge a large data system? • Beyond HDFS • Conclusion
  • 20.
  • 21.
    SQL-based rules • Mostdataset owners are familiar with SQL • Preferable for many analysts over Java • SQL is simple yet expressive • SQL supports custom UDFs (user-defined functions) Row Filter Column Transformations Select rows that should be deleted WHERE clause Replace field values by result of expression SELECT clause
  • 22.
    Dynamic dataset customization •Assign SQL rules to datasets, dataset groups, or use cases • Dynamically compose all applicable rules • Templating UDFs: • default_field_value() • row_timestamp() • pii_fields()
  • 23.
    • Design challengesin privacy • Building a privacy system • What to delete? • How to delete? • How to purge a large data system? • Beyond HDFS • Conclusion
  • 24.
    Purging a record Complianc eOperator Resolved Expression s Hive Input Forma t Hive Table Scan Datase t Record Consumer
  • 25.
  • 26.
    Purging a Dataset Datase t Clean Datase t Complianc eOperator Replace High Security ZonePurged Record s Retentio n
  • 27.
    Why Apache Hive? •SQL parser and evaluator tightly integrated with Hadoop ecosystem • Supports multiple data formats • Some LinkedIn tools already leverage Hive • Dali (Data Access at LinkedIn): Provides storage agnostic access to datasets and views
  • 28.
    Apache Gobblin • Gobblinis a distributed data integration framework that simplifies common aspects of big data integration. • Provides features required for operable purger pipeline: • Record processing • State management • Metric/event emission • Data retention • Mix-and-match of readers, compliance operator, and writers. • …
  • 29.
    Purging a datasystem Metadat a Store Dataset 1 - Part 1 - Part 2 - Part 3 Dataset 2 - Part 1 - Part 2 Dataset 3 Dataset 4 … Gobblin State Store Rememb er where we left off Audit
  • 30.
    Auditing Compliance • Emitaudit events every time a dataset is cleaned • Allows tracking of when a dataset was cleaned and when purge requests have been applied • Can detect when a dataset has not been cleaned in a while • Also emit error events containing failure details • Notify data owners to fix dataset metadata/customizations.
  • 31.
    • Design challengesin privacy • Building a privacy system • What to delete? • How to clean a dataset? • How to purge a very large data ecosystem? • Beyond HDFS • Conclusion
  • 32.
    Read-side compliance Datase t Complianc e Operator Dali Reader UserLogic (Spark, Pig, ...) • Dali – LinkedIn's data access layer, allows accessing datasets from any framework • Dynamic read-side filtering • Allows different views of the same data • Allows for shorter SLA: immediate read-time compliance before data has been purged • Dynamic data obfuscation • Queries can see the data, but no identifying information
  • 33.
    Generic Data StorePurging Three-step process: •Gobblin dumps snapshot into HDFS •Compliance library selects primary keys to delete/modify •Gobblin applies changes to data store Data Store HDFS
  • 34.
    • Design challengesin privacy • Building a privacy system • What to delete? • How to clean a dataset? • How to purge a very large data ecosystem? • Beyond HDFS • Conclusion
  • 35.
  • 36.
    Resources • Gobblin • https://gobblin.apache.org/ •https://github.com/apache/incubator-gobblin • Dali • https://engineering.linkedin.com/teams/data/projects/dali • WhereHows • https://github.com/linkedin/WhereHows
  • 37.
    Acknowledgements • LinkedIn Teams: •Gobblin • Dali • Metadata • Trust Engineering • Legal • House Security • Applications
  • 38.

Editor's Notes

  • #2 Privacy has always been central to LinkedIn in ensuring member&amp;apos;s trust, and GDPR provided an opportunity to double-down on our privacy efforts.
  • #8 Given the scale that we are operating at, it&amp;apos;s important that the compliance framework meet some basic principles. Emphasize that the system we build needs to uphold these core principles
  • #9 To achieve our core tenets, our vision for our compliance framework is to support the following:
  • #10 Transition: In addition to these challenges, there&amp;apos;s also the challenge of the huge scale of data at LinkedIn
  • #15 Heterogenous schemas and data formats Some datasets have member ids, some have company ids or contract ids or article ids How do we know what to delete? You need metadata on all the datasets.
  • #16 Discuss data owners first Automation helps alleviate redundant work, can make suggestions, catch error
  • #17 Transition: Even if you catalog all your datasets and collect metadata about each dataset, there are still challenges.
  • #18 Need to provide closure on this slide: maybe say WhereHows provides some solutions to these problems but outside scope of presentation We discuss high-level solutions in this presentation; will go into detail in subsequent presentation.
  • #27 * keep purged records in case incorrect annotations / bug in purging code
  • #34 Most data stores should be ingested in HDFS anyways for offline data warehouse analyses. Application using data store does not need to implement their own custom purging. Purging can just be done with existing offline system, which uses the compliance library. Generate a list of keys to delete, and then we push the changes to the data store.
  • #36 Given the scale that we are operating at, it&amp;apos;s important that the compliance framework meet some basic principles.