SlideShare a Scribd company logo
Handling Personal Information in
LinkedIn’s Content Ingestion System
David Max
Senior Software Engineer
LinkedIn
About Me
• Software Engineer at LinkedIn NYC
since 2015
• Content Ingestion team
• Office Hours –
Thursday 11:30-12:00
David Max
Senior Software Engineer
LinkedIn
www.linkedin.com/in/davidpmax/
About LinkedIn New York Engineering
• Located in Empire State Building
• Approximately 100 engineers and
1000 employees total
• Multiple teams, front end, back
end, and data science
New York
Engineering
Disclaimers
• I’m not a lawyer
• Some details omitted
• I am not a spokesperson for official LinkedIn
policy
O U R M I S S I O N
Create economic opportunity for every member
of the global workforce
LinkedIn
>546M >70%
• World’s largest professional
network
members of members reside outside the U.S.
• More than 200 countries and
territories worldwide
General Data Protection Regulation
• Applies to all companies worldwide that
process personal data of EU citizens.
• Widens definition of personal data.
• Introduces restrictive data handling
principles.
• Enforceable from May 25, 2018.
Handling Personally Identifiable Information (PII)
Limit personal data
collection, storage,
usage
Data Minimization
Cannot use collected
data for a different
purpose
Consent
Do not hold data
longer then necessary
Retention
Must delete data upon
request
Deletion
Handling PII in Content Ingestion
Content Ingestion Data Protection
Babylonia Data Minimization Consent
Retention Deletion
What is Content Ingestion?
Content Ingestion
Babylonia
Content Ingestion
Babylonia
Content Ingestion
Babylonia
Content Ingestion
Babylonia
url: https://www.youtube.com/watch?v=MS3c9hz0bRg
title: "SATURN 2017 Keynote: Software is Details”
image:
https://i.ytimg.com/vi/MS3c9hz0bRg/hqdefault.jpg?sq
poaymwEYCKgBEF5IVfKriqkDCwgBFQAAiEIYAXABu00
26rs=AOn4CLClwjQlBmMeoRCePtHaThN-qXRHqg
Content Ingestion
Babylonia
What is Content Ingestion?
Content Ingestion
Babylonia
• Extracts metadata from web pages
• Source of Truth for 3rd party content
• Also contains metadata for some
public 1st party content
• Used by LinkedIn services for sharing,
decorating, and embedding content
• Data also feeds into content
understanding and relevance models
How does PII
get into
Babylonia?
Ingesting 1st party
pages containing
publicly viewable
member PII
• Profile pages
• Publish posts
• SlideShare content
When a Member Account is Closed
• Remove scraped data relating to
the member pages that have been
taken down
• Notify downstream systems that
might be holding a copy of the
data
• Babylonia (along with other
systems) is notified that the
member’s account is closed
• Other systems take down the
member’s content
(i.e. public profile page, publish
posts, etc.)
What happens What Babylonia needs to do
Babylonia Datasets
Espresso
Database
HDFS
ETL
Brooklin Data
Change Events
Datasets
Content Ingestion
Babylonia
Downstream and Upstream Datasets
Espresso
Database
HDFS
ETL
Brooklin Data
Change Events
1st party
web page
profile
job
article
publishing
profile
Online
Service
Near
Line
Offline
• Need to identify URLs that
contain a member’s PII.
• My post might contain your PII
• Connection between member
and the URL resides in the
upstream system
Challenges of
member PII in
Babylonia
Option #1: Require Upstream Systems to Notify Babylonia
• Simple – Babylonia waits to be told
specifically which URLs should be purged
• Babylonia only does extra work when a URL
needs to be purged
• Puts responsibility where the knowledge is
Pros Cons
• Requires additional work by every system
that exposes PII in publicly accessible web
pages
• If the notification is missed, how will
Babylonia know?
• 1st party URLs sometimes change as
upstream systems are changed – need to
correctly handle old URLs too
Option #2: Actively Refetch Every 1st Party URL
• Simple logic: Page gone? Purge the page.
• Requires little additional work from
upstream systems
• Works also for old 1st party URLs
Pros Cons
• There are a lot of 1st party URLs in
Babylonia
• Continuous polling of all 1st party URLs
consumes a lot of resources just for the
sake of the very few URLs that are actually
affected
• Extra work to avoid false positives or false
negatives
Option #3: Eliminate Member PII in Babylonia
• The easiest data to delete is data that isn’t
in your system to begin with
• Gets closer to Single Source of Truth (SSOT)
for all 1st party content – better for
consistency, not only for compliance
Pros Cons
• Babylonia is relied upon by numerous systems
to have content for URLs – excluding 1st party
content will affect member experience
• No substitute currently available
• Difficult to achieve based on URL – can’t always
tell by looking at a URL if it resolves to 1st party
content (eg. shortlinks)
Blended Approach
• Option 1 - Having upstream systems notify is
best, but might miss some pages
• Option 2 - Active refetch is thorough but
expensive. Must use to catch pages that
won’t support notifications
• Option 3 - Some pages won’t work with active
refetch. For example, pages that still return
an HTTP status code 200 even when the data
has been removed. These must be blocked
Classification of Ingested URLs
URL
3rd Party
1st Party
Blocked
Whitelisted
Actively
Refetched
Notified by
Upstream
Option 1 – Upstream Notification
• Upstream system sends a
Kafka message
• Babylonia consumes message
and purges data
• Open source -
kafka.apache.org
Option 2 – Active Refetching
Espresso
Database
HDFS
ETL
Refetch
URL table
Refetch
URL table
Offline
job
Refetch
messages
Kafka
Push
job
Refetch
process
UPDATE
Takedown
Requests for
deleted pages
Option 3 – Whitelist
• Block all 1st party URLs that
can’t meet minimal
requirements
• Mainly must return a 404 for an
invalid or deleted URL
• Ensures new 1st party URLs are
onboarded before being
ingested
Managing PII in Datasets
HDFS
ETL
Offline
Datasets
Espresso
Database
Espresso Datasets
Espresso
Datasets
Espresso
Database
• LinkedIn distributed
NoSQL database
• Data stored in Avro
format (JSON)
• Indexed by specific
primary key fields
What is Espresso? Challenges
• Reference to PII not
always in the key
• ETL snapshots of
Espresso Dataset
become offline
Datasets
Offline (HDFS) Datasets
HDFS
ETL
Offline
Datasets• Files of Avro (JSON) records
• Need to read whole record to see if
it has PII
• Files not conducive to removing
one record from the middle
• Dataset can be source for
downstream jobs that also need to
be purged
Challenges
Which datasets contain member PII?
Data Discovery
• Data discovery and lineage tool
• Central location for all schema
• Document meanings of each column
• Trace downstream/upstream lineage
of datasets
• Tag every column that can contain
member reference or PII.
• Open Source -
github.com/linkedin/wherehows
WhereHows
• Interface for accessing datasets
• Combines dataset schema with
WhereHows metadata
• Defines output virtual dataset while
preserving data tags
• Supports defining virtual datasets
where PII is excluded or obfuscated
Dali (Data Access at LinkedIn)
Raw Dataset
WhereHows
Metadata
Dali
Reader
Only systems that handle PII properly
are allowed access
Access Control
• Controls access to PII data to known
list of authorized systems
• We only approve access to systems
that it can handle PII properly
• Ensures that member PII can’t leak
into untracked systems/datasets
• Acts as a list of downstream services
Access Control List (ACL)
Keeping Track of Personal Information in Babylonia
• Field tagging for fields
containing PII
• Know where the PII is
WhereHows Dali ACL
• Downstreams use Dali,
which preserves the
WhereHows tagging on
new virtual datasets
• Keeps tags with the
data as it moves from
one dataset to another
• Control the spread of
PII data only to
authorized readers
• Serves as a list of
current downstream
systems to notify when
data is purged
Apache Gobblin
• Framework for transforming large
datasets
• Data lifecycle management
• Uses WhereHows tags to identify data
in our Espresso or offline datasets that
need to be purged
• Open source - gobblin.apache.org
• Created tags representing
ingested content URLs in
WhereHows
• Enables downstream systems to
onboard with Espresso auto
purge and Gobblin by tagging
columns in their tables as
containing a URL or Ingested
Content URN (Uniform Resource
Name)
Tagging in
WhereHows
WhereHows and Gobblin
• Choose an implementation where
restriction is the default until proven
safe
• Whitelisting ensures all allowed 1st
party URLs meets a minimum
technical bar for ingestion
• Simplicity of active refetching helps
keep the bar low enough to include
most content safely
Compliance Comes First
• Added constraints to the
system
• Developer restrictions
• Made certain kinds of things
harder to do
Constraints
Bigger Picture
“Constraints can act as guide rails that
point a system where you want it to go.”
G E O R G E F A I R B A N K S
• A constrained system is easier
to predict and control
• Make the wrong things harder
to do
• Give guidance to all developers
how things are supposed to be
done
Constraints /
Guide Rails
Bigger Picture
• Constraints should manifest in some
explicit way
• Counter-Example: “No backwards
incompatible schema changes”
• Hard to tell what developers refrained
from doing
• WhereHows, Dali, and ACLs make
metadata and the rules explicit and
thus easier to perpetuate
Manifest
Guide Rails
in the Code
Bigger Picture
A design technique where the
responsibility for a guide rail is
moved away from developer
vigilance into code, with the
goal of achieving a global
property on the system.
Architecture
Hoisting
Bigger Picture
Architecture
Hoisting
Bigger Picture
• Make use of the framework to manage
PII
• Requires developers to think about PII
concerns up front to access the data
• Once set up, developers can focus less
on managing PII because the
architecture is handling it
• Users of the framework can
automatically benefit from future
enhancements
Thank you

More Related Content

Similar to David Max SATURN 2018 - Handling Personal Information in LinkedIn's Content Ingestion System

Governance for power bi Toronto SPS Saturday
Governance for power bi Toronto SPS Saturday Governance for power bi Toronto SPS Saturday
Governance for power bi Toronto SPS Saturday
Berkovich Consulting
 
Datastores for opendata
Datastores for opendataDatastores for opendata
Datastores for opendata
Lex Slaghuis
 
Stop the Madness! A Practical Guide to Making Your Data Catalog Strategy Work
Stop the Madness! A Practical Guide to Making Your Data Catalog Strategy WorkStop the Madness! A Practical Guide to Making Your Data Catalog Strategy Work
Stop the Madness! A Practical Guide to Making Your Data Catalog Strategy Work
DATAVERSITY
 
Building Data Warehouse in SQL Server
Building Data Warehouse in SQL ServerBuilding Data Warehouse in SQL Server
Building Data Warehouse in SQL Server
Antonios Chatzipavlis
 
Webinar: Slippery Slope of SharePoint Migrations
Webinar: Slippery Slope of SharePoint Migrations Webinar: Slippery Slope of SharePoint Migrations
Webinar: Slippery Slope of SharePoint Migrations
WithumSmith+Brown, formerly Portal Solutions
 
5 Tips to Optimize SharePoint While Preparing for Hybrid
5 Tips to Optimize SharePoint While Preparing for Hybrid5 Tips to Optimize SharePoint While Preparing for Hybrid
5 Tips to Optimize SharePoint While Preparing for Hybrid
Adam Levithan
 
Office 365 and using SharePoint Online
Office 365 and using SharePoint OnlineOffice 365 and using SharePoint Online
Office 365 and using SharePoint Online
Cliff Ashcroft
 
Best practices for security and governance in share point 2013 published
Best practices for security and governance in share point 2013   publishedBest practices for security and governance in share point 2013   published
Best practices for security and governance in share point 2013 published
AntonioMaio2
 
Introduction to Conductor
Introduction to ConductorIntroduction to Conductor
Introduction to Conductor
Jason Gleason
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
Jun Rao
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
Costa Pissaris
 
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Tom Rieger
 
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Tracy Blackburn
 
A Guide To Single Sign-On for IBM Collaboration Solutions
A Guide To Single Sign-On for IBM Collaboration SolutionsA Guide To Single Sign-On for IBM Collaboration Solutions
A Guide To Single Sign-On for IBM Collaboration Solutions
Gabriella Davis
 
Unit 2 - Chapter 7 (Database Security).pptx
Unit 2 - Chapter 7 (Database Security).pptxUnit 2 - Chapter 7 (Database Security).pptx
Unit 2 - Chapter 7 (Database Security).pptx
SakshiGawde6
 
CST204 DBMSMODULE1 PPT (1).pptx
CST204 DBMSMODULE1 PPT (1).pptxCST204 DBMSMODULE1 PPT (1).pptx
CST204 DBMSMODULE1 PPT (1).pptx
MEGHANA508383
 
IWMW 2002: Web standards briefing (session C2)
IWMW 2002: Web standards briefing (session C2)IWMW 2002: Web standards briefing (session C2)
IWMW 2002: Web standards briefing (session C2)
IWMW
 
Linked open data project
Linked open data projectLinked open data project
Linked open data project
Faathima Fayaza
 
The Untethered Enterprise - Synchronizing Content Across Multiple Storage Pla...
The Untethered Enterprise - Synchronizing Content Across Multiple Storage Pla...The Untethered Enterprise - Synchronizing Content Across Multiple Storage Pla...
The Untethered Enterprise - Synchronizing Content Across Multiple Storage Pla...
WithumSmith+Brown, formerly Portal Solutions
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data Warehouse
Ramakant Soni
 

Similar to David Max SATURN 2018 - Handling Personal Information in LinkedIn's Content Ingestion System (20)

Governance for power bi Toronto SPS Saturday
Governance for power bi Toronto SPS Saturday Governance for power bi Toronto SPS Saturday
Governance for power bi Toronto SPS Saturday
 
Datastores for opendata
Datastores for opendataDatastores for opendata
Datastores for opendata
 
Stop the Madness! A Practical Guide to Making Your Data Catalog Strategy Work
Stop the Madness! A Practical Guide to Making Your Data Catalog Strategy WorkStop the Madness! A Practical Guide to Making Your Data Catalog Strategy Work
Stop the Madness! A Practical Guide to Making Your Data Catalog Strategy Work
 
Building Data Warehouse in SQL Server
Building Data Warehouse in SQL ServerBuilding Data Warehouse in SQL Server
Building Data Warehouse in SQL Server
 
Webinar: Slippery Slope of SharePoint Migrations
Webinar: Slippery Slope of SharePoint Migrations Webinar: Slippery Slope of SharePoint Migrations
Webinar: Slippery Slope of SharePoint Migrations
 
5 Tips to Optimize SharePoint While Preparing for Hybrid
5 Tips to Optimize SharePoint While Preparing for Hybrid5 Tips to Optimize SharePoint While Preparing for Hybrid
5 Tips to Optimize SharePoint While Preparing for Hybrid
 
Office 365 and using SharePoint Online
Office 365 and using SharePoint OnlineOffice 365 and using SharePoint Online
Office 365 and using SharePoint Online
 
Best practices for security and governance in share point 2013 published
Best practices for security and governance in share point 2013   publishedBest practices for security and governance in share point 2013   published
Best practices for security and governance in share point 2013 published
 
Introduction to Conductor
Introduction to ConductorIntroduction to Conductor
Introduction to Conductor
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
 
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
 
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
 
A Guide To Single Sign-On for IBM Collaboration Solutions
A Guide To Single Sign-On for IBM Collaboration SolutionsA Guide To Single Sign-On for IBM Collaboration Solutions
A Guide To Single Sign-On for IBM Collaboration Solutions
 
Unit 2 - Chapter 7 (Database Security).pptx
Unit 2 - Chapter 7 (Database Security).pptxUnit 2 - Chapter 7 (Database Security).pptx
Unit 2 - Chapter 7 (Database Security).pptx
 
CST204 DBMSMODULE1 PPT (1).pptx
CST204 DBMSMODULE1 PPT (1).pptxCST204 DBMSMODULE1 PPT (1).pptx
CST204 DBMSMODULE1 PPT (1).pptx
 
IWMW 2002: Web standards briefing (session C2)
IWMW 2002: Web standards briefing (session C2)IWMW 2002: Web standards briefing (session C2)
IWMW 2002: Web standards briefing (session C2)
 
Linked open data project
Linked open data projectLinked open data project
Linked open data project
 
The Untethered Enterprise - Synchronizing Content Across Multiple Storage Pla...
The Untethered Enterprise - Synchronizing Content Across Multiple Storage Pla...The Untethered Enterprise - Synchronizing Content Across Multiple Storage Pla...
The Untethered Enterprise - Synchronizing Content Across Multiple Storage Pla...
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data Warehouse
 

Recently uploaded

哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
anoopmanoharan2
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
RadiNasr
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
University of Maribor
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
Mukeshwaran Balu
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
Exception Handling notes in java exception
Exception Handling notes in java exceptionException Handling notes in java exception
Exception Handling notes in java exception
Ratnakar Mikkili
 
Wearable antenna for antenna applications
Wearable antenna for antenna applicationsWearable antenna for antenna applications
Wearable antenna for antenna applications
Madhumitha Jayaram
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
Exception Handling notes in java exception
Exception Handling notes in java exceptionException Handling notes in java exception
Exception Handling notes in java exception
 
Wearable antenna for antenna applications
Wearable antenna for antenna applicationsWearable antenna for antenna applications
Wearable antenna for antenna applications
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 

David Max SATURN 2018 - Handling Personal Information in LinkedIn's Content Ingestion System

  • 1. Handling Personal Information in LinkedIn’s Content Ingestion System David Max Senior Software Engineer LinkedIn
  • 2. About Me • Software Engineer at LinkedIn NYC since 2015 • Content Ingestion team • Office Hours – Thursday 11:30-12:00 David Max Senior Software Engineer LinkedIn www.linkedin.com/in/davidpmax/
  • 3. About LinkedIn New York Engineering • Located in Empire State Building • Approximately 100 engineers and 1000 employees total • Multiple teams, front end, back end, and data science New York Engineering
  • 4. Disclaimers • I’m not a lawyer • Some details omitted • I am not a spokesperson for official LinkedIn policy
  • 5. O U R M I S S I O N Create economic opportunity for every member of the global workforce
  • 6. LinkedIn >546M >70% • World’s largest professional network members of members reside outside the U.S. • More than 200 countries and territories worldwide
  • 7. General Data Protection Regulation • Applies to all companies worldwide that process personal data of EU citizens. • Widens definition of personal data. • Introduces restrictive data handling principles. • Enforceable from May 25, 2018.
  • 8. Handling Personally Identifiable Information (PII) Limit personal data collection, storage, usage Data Minimization Cannot use collected data for a different purpose Consent Do not hold data longer then necessary Retention Must delete data upon request Deletion
  • 9. Handling PII in Content Ingestion Content Ingestion Data Protection Babylonia Data Minimization Consent Retention Deletion
  • 10. What is Content Ingestion? Content Ingestion Babylonia
  • 13. Content Ingestion Babylonia url: https://www.youtube.com/watch?v=MS3c9hz0bRg title: "SATURN 2017 Keynote: Software is Details” image: https://i.ytimg.com/vi/MS3c9hz0bRg/hqdefault.jpg?sq poaymwEYCKgBEF5IVfKriqkDCwgBFQAAiEIYAXABu00 26rs=AOn4CLClwjQlBmMeoRCePtHaThN-qXRHqg
  • 15. What is Content Ingestion? Content Ingestion Babylonia • Extracts metadata from web pages • Source of Truth for 3rd party content • Also contains metadata for some public 1st party content • Used by LinkedIn services for sharing, decorating, and embedding content • Data also feeds into content understanding and relevance models
  • 16. How does PII get into Babylonia?
  • 17. Ingesting 1st party pages containing publicly viewable member PII • Profile pages • Publish posts • SlideShare content
  • 18. When a Member Account is Closed • Remove scraped data relating to the member pages that have been taken down • Notify downstream systems that might be holding a copy of the data • Babylonia (along with other systems) is notified that the member’s account is closed • Other systems take down the member’s content (i.e. public profile page, publish posts, etc.) What happens What Babylonia needs to do
  • 19. Babylonia Datasets Espresso Database HDFS ETL Brooklin Data Change Events Datasets Content Ingestion Babylonia
  • 20. Downstream and Upstream Datasets Espresso Database HDFS ETL Brooklin Data Change Events 1st party web page profile job article publishing profile Online Service Near Line Offline
  • 21. • Need to identify URLs that contain a member’s PII. • My post might contain your PII • Connection between member and the URL resides in the upstream system Challenges of member PII in Babylonia
  • 22. Option #1: Require Upstream Systems to Notify Babylonia • Simple – Babylonia waits to be told specifically which URLs should be purged • Babylonia only does extra work when a URL needs to be purged • Puts responsibility where the knowledge is Pros Cons • Requires additional work by every system that exposes PII in publicly accessible web pages • If the notification is missed, how will Babylonia know? • 1st party URLs sometimes change as upstream systems are changed – need to correctly handle old URLs too
  • 23. Option #2: Actively Refetch Every 1st Party URL • Simple logic: Page gone? Purge the page. • Requires little additional work from upstream systems • Works also for old 1st party URLs Pros Cons • There are a lot of 1st party URLs in Babylonia • Continuous polling of all 1st party URLs consumes a lot of resources just for the sake of the very few URLs that are actually affected • Extra work to avoid false positives or false negatives
  • 24. Option #3: Eliminate Member PII in Babylonia • The easiest data to delete is data that isn’t in your system to begin with • Gets closer to Single Source of Truth (SSOT) for all 1st party content – better for consistency, not only for compliance Pros Cons • Babylonia is relied upon by numerous systems to have content for URLs – excluding 1st party content will affect member experience • No substitute currently available • Difficult to achieve based on URL – can’t always tell by looking at a URL if it resolves to 1st party content (eg. shortlinks)
  • 25. Blended Approach • Option 1 - Having upstream systems notify is best, but might miss some pages • Option 2 - Active refetch is thorough but expensive. Must use to catch pages that won’t support notifications • Option 3 - Some pages won’t work with active refetch. For example, pages that still return an HTTP status code 200 even when the data has been removed. These must be blocked
  • 26. Classification of Ingested URLs URL 3rd Party 1st Party Blocked Whitelisted Actively Refetched Notified by Upstream
  • 27. Option 1 – Upstream Notification • Upstream system sends a Kafka message • Babylonia consumes message and purges data • Open source - kafka.apache.org
  • 28. Option 2 – Active Refetching Espresso Database HDFS ETL Refetch URL table Refetch URL table Offline job Refetch messages Kafka Push job Refetch process UPDATE Takedown Requests for deleted pages
  • 29. Option 3 – Whitelist • Block all 1st party URLs that can’t meet minimal requirements • Mainly must return a 404 for an invalid or deleted URL • Ensures new 1st party URLs are onboarded before being ingested
  • 30. Managing PII in Datasets HDFS ETL Offline Datasets Espresso Database
  • 31. Espresso Datasets Espresso Datasets Espresso Database • LinkedIn distributed NoSQL database • Data stored in Avro format (JSON) • Indexed by specific primary key fields What is Espresso? Challenges • Reference to PII not always in the key • ETL snapshots of Espresso Dataset become offline Datasets
  • 32. Offline (HDFS) Datasets HDFS ETL Offline Datasets• Files of Avro (JSON) records • Need to read whole record to see if it has PII • Files not conducive to removing one record from the middle • Dataset can be source for downstream jobs that also need to be purged Challenges
  • 33. Which datasets contain member PII? Data Discovery • Data discovery and lineage tool • Central location for all schema • Document meanings of each column • Trace downstream/upstream lineage of datasets • Tag every column that can contain member reference or PII. • Open Source - github.com/linkedin/wherehows WhereHows
  • 34. • Interface for accessing datasets • Combines dataset schema with WhereHows metadata • Defines output virtual dataset while preserving data tags • Supports defining virtual datasets where PII is excluded or obfuscated Dali (Data Access at LinkedIn) Raw Dataset WhereHows Metadata Dali Reader
  • 35. Only systems that handle PII properly are allowed access Access Control • Controls access to PII data to known list of authorized systems • We only approve access to systems that it can handle PII properly • Ensures that member PII can’t leak into untracked systems/datasets • Acts as a list of downstream services Access Control List (ACL)
  • 36. Keeping Track of Personal Information in Babylonia • Field tagging for fields containing PII • Know where the PII is WhereHows Dali ACL • Downstreams use Dali, which preserves the WhereHows tagging on new virtual datasets • Keeps tags with the data as it moves from one dataset to another • Control the spread of PII data only to authorized readers • Serves as a list of current downstream systems to notify when data is purged
  • 37. Apache Gobblin • Framework for transforming large datasets • Data lifecycle management • Uses WhereHows tags to identify data in our Espresso or offline datasets that need to be purged • Open source - gobblin.apache.org
  • 38. • Created tags representing ingested content URLs in WhereHows • Enables downstream systems to onboard with Espresso auto purge and Gobblin by tagging columns in their tables as containing a URL or Ingested Content URN (Uniform Resource Name) Tagging in WhereHows WhereHows and Gobblin
  • 39. • Choose an implementation where restriction is the default until proven safe • Whitelisting ensures all allowed 1st party URLs meets a minimum technical bar for ingestion • Simplicity of active refetching helps keep the bar low enough to include most content safely Compliance Comes First
  • 40. • Added constraints to the system • Developer restrictions • Made certain kinds of things harder to do Constraints Bigger Picture
  • 41. “Constraints can act as guide rails that point a system where you want it to go.” G E O R G E F A I R B A N K S
  • 42. • A constrained system is easier to predict and control • Make the wrong things harder to do • Give guidance to all developers how things are supposed to be done Constraints / Guide Rails Bigger Picture
  • 43. • Constraints should manifest in some explicit way • Counter-Example: “No backwards incompatible schema changes” • Hard to tell what developers refrained from doing • WhereHows, Dali, and ACLs make metadata and the rules explicit and thus easier to perpetuate Manifest Guide Rails in the Code Bigger Picture
  • 44. A design technique where the responsibility for a guide rail is moved away from developer vigilance into code, with the goal of achieving a global property on the system. Architecture Hoisting Bigger Picture
  • 45. Architecture Hoisting Bigger Picture • Make use of the framework to manage PII • Requires developers to think about PII concerns up front to access the data • Once set up, developers can focus less on managing PII because the architecture is handling it • Users of the framework can automatically benefit from future enhancements