SlideShare a Scribd company logo
1 of 38
Data Privacy at
Scale
Anthony Hsu, Issac Buenrostro
LinkedIn
DataWorks Summit, June 20, 2018
About the speakers
Issac Buenrostro
Staff Software Engineer
LinkedIn
Apache Gobblin
Anthony Hsu
Staff Software Engineer
LinkedIn
Dali / Data access
Agenda
• Design challenges in privacy
• Building a privacy system:
• What to delete?
• How to clean a dataset?
• How to purge a large data system?
• Beyond HDFS
• Conclusion
• Design challenges in privacy​
• Building a privacy system:
• What to delete?
• How to clean a dataset?
• How to purge a large data system?
• Beyond HDFS
• Conclusion
LinkedIn's Vision
Create economic opportunity for every member of the global
workforce
20M
COMPANIES
15M
JOBS
50K
SKILLS
60K
SCHOOLS
560M
MEMBERS
LinkedIn's
Privacy
Paradox
“On one hand, the company has 500+ million
members trusting the company to protect highly
sensitive data.
On the other hand, one only joins the largest
professional network on the Internet because they
want to be found!"
Kalinda Raina,
Head of Global Privacy, LinkedIn
Central tenets for compliance
framework
Complianc
e vision
• Compliance of every dataset regardless of
format, schema, platform.
• Purge records by arbitrary IDs (LinkedIn
members, Lynda members, corporate
seats, etc.)
• Have reasonable defaults for how datasets
are purged
• Let owners customize how their dataset is
purged via easy-to-write grammar
• Detect violations, mis-tagging, required
customizations, etc.
Design
Challenges
Offline data scale at LinkedIn
• A dozen Hadoop clusters
• 50k+ datasets
• 15+ PB spread across clusters (3x replicated)
• 100+ TB ingested daily
• 1000s of daily Hadoop users
• 30K+ daily Hadoop flows
• 100K+ daily YARN jobs
HDFS Specific Challenges
HDFS is append only
• Deleting a record in the middle of a file requires rewriting the
entire block
How do we efficiently update PBs of data?
• Batch deletes and process them together in a single scan
through the data
• Leverage the Apache Gobblin framework for parallelizing work
and maintaining state
Building a
privacy system
• Design challenges in privacy
• Building a privacy system
• What to delete?
• How to clean a dataset?
• How to purge a large data system?
• Beyond HDFS
• Conclusion
What to delete?
• Thousands of datasets
• Append style data, mutable data, etc.
• Different fields contain purgeable entities
• Start from first principles
• Collect metadata on every single dataset on LinkedIn
Collecting Metadata
• WhereHows is a dataset catalog
• Natural place to also collect compliance metadata
• Ask data owners to specify field tags for all datasets
• Automated annotation of common fields such as event
headers
Dataset Metadata
Metadata Challenges
• Incomplete, incorrect, missing metadata
• Risk of deleting wrong data
• Business/legal requirements
• How to handle non-trivial types: arrays, maps, …
• Custom types and formats
• Member id could be 123, urn:member:123, member_123, etc.
• Composite URNs: urn:customUrn:(<memberId>,<customId>)
Which records to delete?
• Users send Kafka events with IDs needs to be purged from LinkedIn
• Security in place to prevent rogue requests.
• Requests stored in heavily compressed lookup tables
• Requests can be applied globally, or restricted to specific datasets, groups
of datasets, or fields.
• Lookup tables available at runtime via lookup_table('<entity>') UDF.
Purge
Reques
t
Lookup
Table
Store
lookup_table('member')
• Design challenges in privacy
• Building a privacy system
• What to delete?
• How to delete?
• How to purge a large data system?
• Beyond HDFS
• Conclusion
How to delete
Dataset
'member'
Lookup
Table
Map
-Join
Filter
SQL-based rules
• Most dataset owners are familiar with SQL
• Preferable for many analysts over Java
• SQL is simple yet expressive
• SQL supports custom UDFs (user-defined functions)
Row Filter Column Transformations
Select rows that should be deleted
WHERE clause
Replace field values by result of
expression
SELECT clause
Dynamic dataset customization
• Assign SQL rules to datasets, dataset groups, or use cases
• Dynamically compose all applicable rules
• Templating UDFs:
• default_field_value()
• row_timestamp()
• pii_fields()
• Design challenges in privacy
• Building a privacy system
• What to delete?
• How to delete?
• How to purge a large data system?
• Beyond HDFS
• Conclusion
Purging a record
Complianc
e Operator
Resolved
Expression
s
Hive
Input
Forma
t
Hive
Table
Scan
Datase
t
Record
Consumer
Compliance Operator
Lookup
Table
Store
Metadata
Store
SQL
Rules
Query
Context
(dataset,
user,…)
Compliance
Operator
Purging a Dataset
Datase
t
Clean
Datase
t
Complianc
e Operator
Replace
High Security
ZonePurged
Record
s
Retentio
n
Why Apache Hive?
• SQL parser and evaluator tightly integrated with Hadoop ecosystem
• Supports multiple data formats
• Some LinkedIn tools already leverage Hive
• Dali (Data Access at LinkedIn): Provides storage agnostic access to datasets and views
Apache Gobblin
• Gobblin is a distributed data integration framework that simplifies common
aspects of big data integration.
• Provides features required for operable purger pipeline:
• Record processing
• State management
• Metric/event emission
• Data retention
• Mix-and-match of readers, compliance operator, and writers.
• …
Purging a data system
Metadat
a Store
Dataset 1
- Part 1
- Part 2
- Part 3
Dataset 2
- Part 1
- Part 2
Dataset 3
Dataset 4
…
Gobblin
State
Store
Rememb
er where
we left
off
Audit
Auditing Compliance
• Emit audit events every time a dataset is cleaned
• Allows tracking of when a dataset was cleaned and when purge requests have been
applied
• Can detect when a dataset has not been cleaned in a while
• Also emit error events containing failure details
• Notify data owners to fix dataset metadata/customizations.
• Design challenges in privacy
• Building a privacy system
• What to delete?
• How to clean a dataset?
• How to purge a very large data ecosystem?
• Beyond HDFS
• Conclusion
Read-side compliance
Datase
t
Complianc
e Operator
Dali
Reader
User Logic
(Spark,
Pig, ...)
• Dali – LinkedIn's data access layer, allows accessing datasets from any
framework
• Dynamic read-side filtering
• Allows different views of the same data
• Allows for shorter SLA: immediate read-time compliance before data has been purged
• Dynamic data obfuscation
• Queries can see the data, but no identifying information
Generic Data Store Purging
Three-step process:
•Gobblin dumps snapshot into HDFS
•Compliance library selects primary keys to delete/modify
•Gobblin applies changes to data store
Data
Store
HDFS
• Design challenges in privacy
• Building a privacy system
• What to delete?
• How to clean a dataset?
• How to purge a very large data ecosystem?
• Beyond HDFS
• Conclusion
Gobblin Hive
Compliance Framework
Wherehows
SQL Audit
Dali
Resources
• Gobblin
• https://gobblin.apache.org/
• https://github.com/apache/incubator-gobblin
• Dali
• https://engineering.linkedin.com/teams/data/projects/dali
• WhereHows
• https://github.com/linkedin/WhereHows
Acknowledgements
• LinkedIn Teams:
• Gobblin
• Dali
• Metadata
• Trust Engineering
• Legal
• House Security
• Applications
Thank you!

More Related Content

What's hot

Breaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AIBreaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AIDataWorks Summit
 
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...DataWorks Summit
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionProtect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionDataWorks Summit
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...DataWorks Summit
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityDataWorks Summit
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsDataWorks Summit
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the CloudDataWorks Summit
 
Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?DataWorks Summit/Hadoop Summit
 
How big data and AI saved the day: critical IP almost walked out the door
How big data and AI saved the day: critical IP almost walked out the doorHow big data and AI saved the day: critical IP almost walked out the door
How big data and AI saved the day: critical IP almost walked out the doorDataWorks Summit
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsDataWorks Summit
 
Just the sketch: advanced streaming analytics in Apache Metron
Just the sketch: advanced streaming analytics in Apache MetronJust the sketch: advanced streaming analytics in Apache Metron
Just the sketch: advanced streaming analytics in Apache MetronDataWorks Summit
 
Implementing Security on a Large Multi-Tenant Cluster the Right Way
Implementing Security on a Large Multi-Tenant Cluster the Right WayImplementing Security on a Large Multi-Tenant Cluster the Right Way
Implementing Security on a Large Multi-Tenant Cluster the Right WayDataWorks Summit
 
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...DataWorks Summit
 
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...DataWorks Summit
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesDataWorks Summit
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceBlueData, Inc.
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidDataWorks Summit
 

What's hot (20)

LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Breaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AIBreaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AI
 
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionProtect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Hybrid Data Platform
Hybrid Data Platform Hybrid Data Platform
Hybrid Data Platform
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
 
Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?
 
How big data and AI saved the day: critical IP almost walked out the door
How big data and AI saved the day: critical IP almost walked out the doorHow big data and AI saved the day: critical IP almost walked out the door
How big data and AI saved the day: critical IP almost walked out the door
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Just the sketch: advanced streaming analytics in Apache Metron
Just the sketch: advanced streaming analytics in Apache MetronJust the sketch: advanced streaming analytics in Apache Metron
Just the sketch: advanced streaming analytics in Apache Metron
 
Implementing Security on a Large Multi-Tenant Cluster the Right Way
Implementing Security on a Large Multi-Tenant Cluster the Right WayImplementing Security on a Large Multi-Tenant Cluster the Right Way
Implementing Security on a Large Multi-Tenant Cluster the Right Way
 
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
 
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and Druid
 

Similar to Data Privacy at Scale

Building Data Warehouse in SQL Server
Building Data Warehouse in SQL ServerBuilding Data Warehouse in SQL Server
Building Data Warehouse in SQL ServerAntonios Chatzipavlis
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
David Max SATURN 2018 - Handling Personal Information in LinkedIn's Content I...
David Max SATURN 2018 - Handling Personal Information in LinkedIn's Content I...David Max SATURN 2018 - Handling Personal Information in LinkedIn's Content I...
David Max SATURN 2018 - Handling Personal Information in LinkedIn's Content I...David Max
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
CS3270 - DATABASE SYSTEM - Lecture (1)
CS3270 - DATABASE SYSTEM -  Lecture (1)CS3270 - DATABASE SYSTEM -  Lecture (1)
CS3270 - DATABASE SYSTEM - Lecture (1)Dilawar Khan
 
Database management system lecture notes
Database management system lecture notesDatabase management system lecture notes
Database management system lecture notesUTSAHSINGH2
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeSaurabh K. Gupta
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningSergey Karayev
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricNathan Bijnens
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
ETL processes , Datawarehouse and Datamarts.pptx
ETL processes , Datawarehouse and Datamarts.pptxETL processes , Datawarehouse and Datamarts.pptx
ETL processes , Datawarehouse and Datamarts.pptxParnalSatle
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016Kent Graziano
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsKellyn Pot'Vin-Gorman
 

Similar to Data Privacy at Scale (20)

Building Data Warehouse in SQL Server
Building Data Warehouse in SQL ServerBuilding Data Warehouse in SQL Server
Building Data Warehouse in SQL Server
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
David Max SATURN 2018 - Handling Personal Information in LinkedIn's Content I...
David Max SATURN 2018 - Handling Personal Information in LinkedIn's Content I...David Max SATURN 2018 - Handling Personal Information in LinkedIn's Content I...
David Max SATURN 2018 - Handling Personal Information in LinkedIn's Content I...
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
CS3270 - DATABASE SYSTEM - Lecture (1)
CS3270 - DATABASE SYSTEM -  Lecture (1)CS3270 - DATABASE SYSTEM -  Lecture (1)
CS3270 - DATABASE SYSTEM - Lecture (1)
 
Database management system lecture notes
Database management system lecture notesDatabase management system lecture notes
Database management system lecture notes
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep Learning
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
ETL processes , Datawarehouse and Datamarts.pptx
ETL processes , Datawarehouse and Datamarts.pptxETL processes , Datawarehouse and Datamarts.pptx
ETL processes , Datawarehouse and Datamarts.pptx
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016
 
Foundations of business intelligence databases and information management
Foundations of business intelligence databases and information managementFoundations of business intelligence databases and information management
Foundations of business intelligence databases and information management
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI Options
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Data Privacy at Scale

  • 1. Data Privacy at Scale Anthony Hsu, Issac Buenrostro LinkedIn DataWorks Summit, June 20, 2018
  • 2. About the speakers Issac Buenrostro Staff Software Engineer LinkedIn Apache Gobblin Anthony Hsu Staff Software Engineer LinkedIn Dali / Data access
  • 3. Agenda • Design challenges in privacy • Building a privacy system: • What to delete? • How to clean a dataset? • How to purge a large data system? • Beyond HDFS • Conclusion
  • 4. • Design challenges in privacy​ • Building a privacy system: • What to delete? • How to clean a dataset? • How to purge a large data system? • Beyond HDFS • Conclusion
  • 5. LinkedIn's Vision Create economic opportunity for every member of the global workforce 20M COMPANIES 15M JOBS 50K SKILLS 60K SCHOOLS 560M MEMBERS
  • 6. LinkedIn's Privacy Paradox “On one hand, the company has 500+ million members trusting the company to protect highly sensitive data. On the other hand, one only joins the largest professional network on the Internet because they want to be found!" Kalinda Raina, Head of Global Privacy, LinkedIn
  • 7. Central tenets for compliance framework
  • 8. Complianc e vision • Compliance of every dataset regardless of format, schema, platform. • Purge records by arbitrary IDs (LinkedIn members, Lynda members, corporate seats, etc.) • Have reasonable defaults for how datasets are purged • Let owners customize how their dataset is purged via easy-to-write grammar • Detect violations, mis-tagging, required customizations, etc.
  • 10. Offline data scale at LinkedIn • A dozen Hadoop clusters • 50k+ datasets • 15+ PB spread across clusters (3x replicated) • 100+ TB ingested daily • 1000s of daily Hadoop users • 30K+ daily Hadoop flows • 100K+ daily YARN jobs
  • 11. HDFS Specific Challenges HDFS is append only • Deleting a record in the middle of a file requires rewriting the entire block How do we efficiently update PBs of data? • Batch deletes and process them together in a single scan through the data • Leverage the Apache Gobblin framework for parallelizing work and maintaining state
  • 13. • Design challenges in privacy • Building a privacy system • What to delete? • How to clean a dataset? • How to purge a large data system? • Beyond HDFS • Conclusion
  • 14. What to delete? • Thousands of datasets • Append style data, mutable data, etc. • Different fields contain purgeable entities • Start from first principles • Collect metadata on every single dataset on LinkedIn
  • 15. Collecting Metadata • WhereHows is a dataset catalog • Natural place to also collect compliance metadata • Ask data owners to specify field tags for all datasets • Automated annotation of common fields such as event headers
  • 17. Metadata Challenges • Incomplete, incorrect, missing metadata • Risk of deleting wrong data • Business/legal requirements • How to handle non-trivial types: arrays, maps, … • Custom types and formats • Member id could be 123, urn:member:123, member_123, etc. • Composite URNs: urn:customUrn:(<memberId>,<customId>)
  • 18. Which records to delete? • Users send Kafka events with IDs needs to be purged from LinkedIn • Security in place to prevent rogue requests. • Requests stored in heavily compressed lookup tables • Requests can be applied globally, or restricted to specific datasets, groups of datasets, or fields. • Lookup tables available at runtime via lookup_table('<entity>') UDF. Purge Reques t Lookup Table Store lookup_table('member')
  • 19. • Design challenges in privacy • Building a privacy system • What to delete? • How to delete? • How to purge a large data system? • Beyond HDFS • Conclusion
  • 21. SQL-based rules • Most dataset owners are familiar with SQL • Preferable for many analysts over Java • SQL is simple yet expressive • SQL supports custom UDFs (user-defined functions) Row Filter Column Transformations Select rows that should be deleted WHERE clause Replace field values by result of expression SELECT clause
  • 22. Dynamic dataset customization • Assign SQL rules to datasets, dataset groups, or use cases • Dynamically compose all applicable rules • Templating UDFs: • default_field_value() • row_timestamp() • pii_fields()
  • 23. • Design challenges in privacy • Building a privacy system • What to delete? • How to delete? • How to purge a large data system? • Beyond HDFS • Conclusion
  • 24. Purging a record Complianc e Operator Resolved Expression s Hive Input Forma t Hive Table Scan Datase t Record Consumer
  • 26. Purging a Dataset Datase t Clean Datase t Complianc e Operator Replace High Security ZonePurged Record s Retentio n
  • 27. Why Apache Hive? • SQL parser and evaluator tightly integrated with Hadoop ecosystem • Supports multiple data formats • Some LinkedIn tools already leverage Hive • Dali (Data Access at LinkedIn): Provides storage agnostic access to datasets and views
  • 28. Apache Gobblin • Gobblin is a distributed data integration framework that simplifies common aspects of big data integration. • Provides features required for operable purger pipeline: • Record processing • State management • Metric/event emission • Data retention • Mix-and-match of readers, compliance operator, and writers. • …
  • 29. Purging a data system Metadat a Store Dataset 1 - Part 1 - Part 2 - Part 3 Dataset 2 - Part 1 - Part 2 Dataset 3 Dataset 4 … Gobblin State Store Rememb er where we left off Audit
  • 30. Auditing Compliance • Emit audit events every time a dataset is cleaned • Allows tracking of when a dataset was cleaned and when purge requests have been applied • Can detect when a dataset has not been cleaned in a while • Also emit error events containing failure details • Notify data owners to fix dataset metadata/customizations.
  • 31. • Design challenges in privacy • Building a privacy system • What to delete? • How to clean a dataset? • How to purge a very large data ecosystem? • Beyond HDFS • Conclusion
  • 32. Read-side compliance Datase t Complianc e Operator Dali Reader User Logic (Spark, Pig, ...) • Dali – LinkedIn's data access layer, allows accessing datasets from any framework • Dynamic read-side filtering • Allows different views of the same data • Allows for shorter SLA: immediate read-time compliance before data has been purged • Dynamic data obfuscation • Queries can see the data, but no identifying information
  • 33. Generic Data Store Purging Three-step process: •Gobblin dumps snapshot into HDFS •Compliance library selects primary keys to delete/modify •Gobblin applies changes to data store Data Store HDFS
  • 34. • Design challenges in privacy • Building a privacy system • What to delete? • How to clean a dataset? • How to purge a very large data ecosystem? • Beyond HDFS • Conclusion
  • 36. Resources • Gobblin • https://gobblin.apache.org/ • https://github.com/apache/incubator-gobblin • Dali • https://engineering.linkedin.com/teams/data/projects/dali • WhereHows • https://github.com/linkedin/WhereHows
  • 37. Acknowledgements • LinkedIn Teams: • Gobblin • Dali • Metadata • Trust Engineering • Legal • House Security • Applications

Editor's Notes

  1. Privacy has always been central to LinkedIn in ensuring member&amp;apos;s trust, and GDPR provided an opportunity to double-down on our privacy efforts.
  2. Given the scale that we are operating at, it&amp;apos;s important that the compliance framework meet some basic principles. Emphasize that the system we build needs to uphold these core principles
  3. To achieve our core tenets, our vision for our compliance framework is to support the following:
  4. Transition: In addition to these challenges, there&amp;apos;s also the challenge of the huge scale of data at LinkedIn
  5. Heterogenous schemas and data formats Some datasets have member ids, some have company ids or contract ids or article ids How do we know what to delete? You need metadata on all the datasets.
  6. Discuss data owners first Automation helps alleviate redundant work, can make suggestions, catch error
  7. Transition: Even if you catalog all your datasets and collect metadata about each dataset, there are still challenges.
  8. Need to provide closure on this slide: maybe say WhereHows provides some solutions to these problems but outside scope of presentation We discuss high-level solutions in this presentation; will go into detail in subsequent presentation.
  9. * keep purged records in case incorrect annotations / bug in purging code
  10. Most data stores should be ingested in HDFS anyways for offline data warehouse analyses. Application using data store does not need to implement their own custom purging. Purging can just be done with existing offline system, which uses the compliance library. Generate a list of keys to delete, and then we push the changes to the data store.
  11. Given the scale that we are operating at, it&amp;apos;s important that the compliance framework meet some basic principles.