Member privacy is of paramount importance to LinkedIn. The company must protect the sensitive data users provide. On the other hand, our members join LinkedIn to find each other, necessitating the sharing of certain data. This privacy paradox can only be addressed by giving users control over where and how their data is used. While this approach is extremely important, it also presents scaling challenges.
In this talk, we will discuss the challenges behind enforcing compliance at scale as well as LinkedIn's solution. Our comprehensive record-level offline compliance framework includes schema metadata tracking, alternate read-time views of the same dataset, physical purging of data on HDFS, and features for users to define custom filtering rules using SQL, assigning such customizations to specific datasets, groups of datasets, or use cases. We achieve this using many open-source projects like Hadoop, Hive, Gobblin, and Wherehows, as well as a homegrown data access layer called Dali. We also show how the same Hadoop-powered framework can be used for enforcing compliance on other stores like Pinot, Salesforce, and Espresso.
While there is no one-size fits all solution to guaranteeing user data privacy, this talk will provide a blueprint and concrete example of how to enforce compliance at scale, which we hope proves useful to organizations working to improve their privacy commitments. ISSAC BUENROSTRO, Staff Software Engineer, LinkedIn and ANTHONY HSU, Staff Software Engineer, LinkedIn
2. About the speakers
Issac Buenrostro
Staff Software Engineer
LinkedIn
Apache Gobblin
Anthony Hsu
Staff Software Engineer
LinkedIn
Dali / Data access
3. Agenda
• Design challenges in privacy
• Building a privacy system:
• What to delete?
• How to clean a dataset?
• How to purge a large data system?
• Beyond HDFS
• Conclusion
4. • Design challenges in privacy
• Building a privacy system:
• What to delete?
• How to clean a dataset?
• How to purge a large data system?
• Beyond HDFS
• Conclusion
5. LinkedIn's Vision
Create economic opportunity for every member of the global
workforce
20M
COMPANIES
15M
JOBS
50K
SKILLS
60K
SCHOOLS
560M
MEMBERS
6. LinkedIn's
Privacy
Paradox
“On one hand, the company has 500+ million
members trusting the company to protect highly
sensitive data.
On the other hand, one only joins the largest
professional network on the Internet because they
want to be found!"
Kalinda Raina,
Head of Global Privacy, LinkedIn
8. Complianc
e vision
• Compliance of every dataset regardless of
format, schema, platform.
• Purge records by arbitrary IDs (LinkedIn
members, Lynda members, corporate
seats, etc.)
• Have reasonable defaults for how datasets
are purged
• Let owners customize how their dataset is
purged via easy-to-write grammar
• Detect violations, mis-tagging, required
customizations, etc.
10. Offline data scale at LinkedIn
• A dozen Hadoop clusters
• 50k+ datasets
• 15+ PB spread across clusters (3x replicated)
• 100+ TB ingested daily
• 1000s of daily Hadoop users
• 30K+ daily Hadoop flows
• 100K+ daily YARN jobs
11. HDFS Specific Challenges
HDFS is append only
• Deleting a record in the middle of a file requires rewriting the
entire block
How do we efficiently update PBs of data?
• Batch deletes and process them together in a single scan
through the data
• Leverage the Apache Gobblin framework for parallelizing work
and maintaining state
13. • Design challenges in privacy
• Building a privacy system
• What to delete?
• How to clean a dataset?
• How to purge a large data system?
• Beyond HDFS
• Conclusion
14. What to delete?
• Thousands of datasets
• Append style data, mutable data, etc.
• Different fields contain purgeable entities
• Start from first principles
• Collect metadata on every single dataset on LinkedIn
15. Collecting Metadata
• WhereHows is a dataset catalog
• Natural place to also collect compliance metadata
• Ask data owners to specify field tags for all datasets
• Automated annotation of common fields such as event
headers
17. Metadata Challenges
• Incomplete, incorrect, missing metadata
• Risk of deleting wrong data
• Business/legal requirements
• How to handle non-trivial types: arrays, maps, …
• Custom types and formats
• Member id could be 123, urn:member:123, member_123, etc.
• Composite URNs: urn:customUrn:(<memberId>,<customId>)
18. Which records to delete?
• Users send Kafka events with IDs needs to be purged from LinkedIn
• Security in place to prevent rogue requests.
• Requests stored in heavily compressed lookup tables
• Requests can be applied globally, or restricted to specific datasets, groups
of datasets, or fields.
• Lookup tables available at runtime via lookup_table('<entity>') UDF.
Purge
Reques
t
Lookup
Table
Store
lookup_table('member')
19. • Design challenges in privacy
• Building a privacy system
• What to delete?
• How to delete?
• How to purge a large data system?
• Beyond HDFS
• Conclusion
21. SQL-based rules
• Most dataset owners are familiar with SQL
• Preferable for many analysts over Java
• SQL is simple yet expressive
• SQL supports custom UDFs (user-defined functions)
Row Filter Column Transformations
Select rows that should be deleted
WHERE clause
Replace field values by result of
expression
SELECT clause
22. Dynamic dataset customization
• Assign SQL rules to datasets, dataset groups, or use cases
• Dynamically compose all applicable rules
• Templating UDFs:
• default_field_value()
• row_timestamp()
• pii_fields()
23. • Design challenges in privacy
• Building a privacy system
• What to delete?
• How to delete?
• How to purge a large data system?
• Beyond HDFS
• Conclusion
24. Purging a record
Complianc
e Operator
Resolved
Expression
s
Hive
Input
Forma
t
Hive
Table
Scan
Datase
t
Record
Consumer
27. Why Apache Hive?
• SQL parser and evaluator tightly integrated with Hadoop ecosystem
• Supports multiple data formats
• Some LinkedIn tools already leverage Hive
• Dali (Data Access at LinkedIn): Provides storage agnostic access to datasets and views
28. Apache Gobblin
• Gobblin is a distributed data integration framework that simplifies common
aspects of big data integration.
• Provides features required for operable purger pipeline:
• Record processing
• State management
• Metric/event emission
• Data retention
• Mix-and-match of readers, compliance operator, and writers.
• …
29. Purging a data system
Metadat
a Store
Dataset 1
- Part 1
- Part 2
- Part 3
Dataset 2
- Part 1
- Part 2
Dataset 3
Dataset 4
…
Gobblin
State
Store
Rememb
er where
we left
off
Audit
30. Auditing Compliance
• Emit audit events every time a dataset is cleaned
• Allows tracking of when a dataset was cleaned and when purge requests have been
applied
• Can detect when a dataset has not been cleaned in a while
• Also emit error events containing failure details
• Notify data owners to fix dataset metadata/customizations.
31. • Design challenges in privacy
• Building a privacy system
• What to delete?
• How to clean a dataset?
• How to purge a very large data ecosystem?
• Beyond HDFS
• Conclusion
32. Read-side compliance
Datase
t
Complianc
e Operator
Dali
Reader
User Logic
(Spark,
Pig, ...)
• Dali – LinkedIn's data access layer, allows accessing datasets from any
framework
• Dynamic read-side filtering
• Allows different views of the same data
• Allows for shorter SLA: immediate read-time compliance before data has been purged
• Dynamic data obfuscation
• Queries can see the data, but no identifying information
33. Generic Data Store Purging
Three-step process:
•Gobblin dumps snapshot into HDFS
•Compliance library selects primary keys to delete/modify
•Gobblin applies changes to data store
Data
Store
HDFS
34. • Design challenges in privacy
• Building a privacy system
• What to delete?
• How to clean a dataset?
• How to purge a very large data ecosystem?
• Beyond HDFS
• Conclusion
Privacy has always been central to LinkedIn in ensuring member&apos;s trust, and GDPR provided an opportunity to double-down on our privacy efforts.
Given the scale that we are operating at, it&apos;s important that the compliance framework meet some basic principles.
Emphasize that the system we build needs to uphold these core principles
To achieve our core tenets, our vision for our compliance framework is to support the following:
Transition: In addition to these challenges, there&apos;s also the challenge of the huge scale of data at LinkedIn
Heterogenous schemas and data formats
Some datasets have member ids, some have company ids or contract ids or article ids
How do we know what to delete? You need metadata on all the datasets.
Discuss data owners first
Automation helps alleviate redundant work, can make suggestions, catch error
Transition: Even if you catalog all your datasets and collect metadata about each dataset, there are still challenges.
Need to provide closure on this slide: maybe say WhereHows provides some solutions to these problems but outside scope of presentation
We discuss high-level solutions in this presentation; will go into detail in subsequent presentation.
* keep purged records in case incorrect annotations / bug in purging code
Most data stores should be ingested in HDFS anyways for offline data warehouse analyses. Application using data store does not need to implement their own custom purging. Purging can just be done with existing offline system, which uses the compliance library. Generate a list of keys to delete, and then we push the changes to the data store.
Given the scale that we are operating at, it&apos;s important that the compliance framework meet some basic principles.