SlideShare a Scribd company logo
1 of 29
LinkedIn Salary: A System for Secure
Collection and Presentation of Structured
Compensation Insights to Job Seekers
Krishnaram Kenthapadi
Staff Software Engineer and Applied Researcher, LinkedIn
(Joint work with Ahsan Chudhary, Stuart Ambler)
Outline
▪ LinkedIn Salary Overview
▪ Challenges: Privacy, Modeling
▪ System Design & Architecture
▪ Privacy vs. Modeling Tradeoffs
LinkedIn Salary (launched in Nov, 2016)
Salary Collection Flow via Email Targeting
Current Reach (July 2017)
▪ A few million responses out of several millions of
members targeted
– Targeted via emails since early 2016
▪ Countries: US, CA, UK, DE
▪ Insights available for a large fraction of US monthly
active users
▪ Minimize the risk of inferring any one individual’s
compensation data
▪ Protection against data breach
– No single point of failure
Data Privacy Challenges
Achieved by a combination of techniques:
encryption, access control,
, aggregation, thresholding
▪ Modeling on aggregated data
▪ Evaluation
▪ Outlier detection
▪ Robustness and stability
Modeling Challenges
K. Kenthapadi, S. Ambler,
L. Zhang, and D. Agarwal,
Bringing salary transparency to
the world: Computing robust
compensation insights via
LinkedIn Salary, CIKM 2017
(arxiv.org/abs/1703.09845)
Problem Statement
▪How do we design LinkedIn Salary system taking into
account the unique privacy and security challenges, while
addressing the product requirements?
Differential Privacy? [Dwork et al, 2006]
▪Rich privacy literature (Adam-Worthmann, Samarati-Sweeney, Agrawal-
Srikant, …, Kenthapadi et al, Machanavajjhala et al, Li et al, Dwork et al)
– Limitation of anonymization techniques (Backstrom et al,…,Narayanan et al)
▪Worst case sensitivity of quantiles to any one user’s
compensation data is large
–  Large noise to be added, depriving reliability/usefulness
▪Need compensation insights on a continual basis
– Theoretical work on applying differential privacy under continual observations
▪ No practical implementations / applications
– Randomized response based approaches (Google’s RAPPOR; Apple) not applicable
Title Region
$$
User Exp
Designer
SF Bay
Area
100K
User Exp
Designer
SF Bay
Area
115K
... ... ...
Title Region
$$
User Exp
Designer
SF Bay
Area
100K
De-identification Example
Title Region Company Industry Years
of exp
Degree FoS Skills
$$
User Exp
Designer
SF Bay
Area
Google Internet 12 BS Interacti
ve
Media
UX,
Graphic
s, ...
100K
Title Region Industry
$$
User Exp
Designer
SF Bay
Area
Internet 100K
Title Region Years
of exp
$$
User Exp
Designer
SF Bay
Area
10+ 100K
Title Region Company Years
of exp
$$
User Exp
Designer
SF Bay
Area
Google 10+ 100K
#data
points >
threshold? Yes ⇒ Copy to
Hadoop (HDFS)
Note: Original submission stored as encrypted objects.
System
Architecture
Collection
&
Storage
Collection & Storage
▪Allow members to submit their compensation info
▪Extract member attributes
– E.g., canonical job title, company, region, by invoking LinkedIn
standardization services
▪Securely store member attributes & compensation data
De-identification
&
Grouping
De-identification & Grouping
▪Approach inspired by k-Anonymity [Samarati-Sweeney]
▪“Cohort” or “Slice”
– Defined by a combination of attributes
– E.g, “User experience designers in SF Bay Area”
▪ Contains aggregated compensation entries from corresponding individuals
▪ No user name, id or any attributes other than those that define the cohort
– A cohort available for offline processing only if it has at least k entries
– Apply LinkedIn standardization software (free-form attribute  canonical
version) before grouping
▪ Analogous to the generalization step in k-Anonymity
▪Slicing service
– Access member attribute
info & submission identifiers
(no compensation data)
– Generate slices & track #
submissions for each slice
▪Preparation service
– Fetch compensation data
(using submission
identifiers), associate with
the slice data, copy to
HDFS
De-identification & Grouping
Insights
&
Modeling
Insights & Modeling
▪Salary insight service
– Check whether the
member is eligible
▪ Give-to-get model
– If yes, show the insights
▪Offline workflow
– Consume de-identified
HDFS dataset
– Compute robust
compensation insights
▪ Outlier detection
▪ Bayesian smoothing
– Populate the insight key-
value stores
Security
Mechanisms
Security
Mechanisms
▪Encryption of
member
attributes &
compensation
data using
different sets of
keys
– Separation of
processing
– Limiting access
to the keys
Security
Mechanisms
▪Key rotation
▪No single point
of failure
▪Infra security
Preventing Timestamp Join based Attacks
▪Inference attack by joining these on timestamp
– De-identified compensation data
– Page view logs (when a member accessed compensation collection web interface)
–  Not desirable to retain the exact timestamp
▪Perturb by adding random delay (say, up to 48 hours)
▪Modification based on k-Anonymity
– Generalization using a hierarchy of timestamps
– But, need to be incremental
–  Process entries within a cohort in batches of size k
▪ Generalize to a common timestamp
▪ Make additional data available only in such incremental batches
Privacy vs Modeling Tradeoffs
▪Our system deployed in production for ~1.5 years
▪Study tradeoffs between privacy guarantees (‘k’) and data
available for computing insights
– Dataset: Compensation submission history from 1.5M LinkedIn members
– Amount of data available vs. minimum threshold, k
– Effect of processing entries in batches of size, k
Amount of
data
available
vs.
threshold, k
Percent of
data
available vs.
batch size, k
Median
delay due to
batching vs.
batch size, k
Summary
▪ LinkedIn Salary: a new internet application
– Privacy and Modeling Challenges
– System Design & Architecture
– Privacy vs. Modeling Tradeoffs
▪ Potential directions
– Privacy-preserving machine learning models in a practical
setting [e.g., Chaudhuri et al, Papernot et al]
Thanks & Pointers
▪ Related tech report: K. Kenthapadi, S. Ambler, L. Zhang, and D. Agarwal,
Bringing salary transparency to the world: Computing robust compensation
insights via LinkedIn Salary, 2017 (arxiv.org/abs/1703.09845)
▪ Team:
Careers Engineering
 Ahsan Chudhary
 Alan Yang
 Alex Navasardyan
 Brandyn Bennett
 Hrishikesh S
 Jim Tao
 Juan Pablo Lomeli Diaz
 Lu Zheng
 Patrick Schutz
 Ricky Yan
 Stephanie Chou
 Joseph Florencio
 Santosh Kumar Kancha
 Anthony Duerr
Data Relevance Engineering
 Krishnaram Kenthapadi, Stuart Ambler, Yiqun Liu, Parul Jain, Liang
Zhang, Ganesh Venkataraman, Tim Converse, Deepak Agarwal
Product Managers: Ryan Sandler, Keren Baruch
UED: Julie Kuang
Marketing: Phil Bunge
Business Operations: Prateek Janardhan
BA: Fiona Li
Testing: Bharath Shetty
ProdOps/VOM: Sunil Mahadeshwar
Security: Cory Scott, Tushar Dalvi, and team
linkedin.com/salary
LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers

More Related Content

More from Krishnaram Kenthapadi

Fairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsFairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsKrishnaram Kenthapadi
 
Explainable AI in Industry (FAT* 2020 Tutorial)
Explainable AI in Industry (FAT* 2020 Tutorial)Explainable AI in Industry (FAT* 2020 Tutorial)
Explainable AI in Industry (FAT* 2020 Tutorial)Krishnaram Kenthapadi
 
Fairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsFairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsKrishnaram Kenthapadi
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (KD...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (KD...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (KD...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (KD...Krishnaram Kenthapadi
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Krishnaram Kenthapadi
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WW...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WW...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WW...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WW...Krishnaram Kenthapadi
 
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)Krishnaram Kenthapadi
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...Krishnaram Kenthapadi
 
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)Krishnaram Kenthapadi
 
Fairness, Transparency, and Privacy in AI @ LinkedIn
Fairness, Transparency, and Privacy in AI @ LinkedInFairness, Transparency, and Privacy in AI @ LinkedIn
Fairness, Transparency, and Privacy in AI @ LinkedInKrishnaram Kenthapadi
 
Privacy-preserving Analytics and Data Mining at LinkedIn
Privacy-preserving Analytics and Data Mining at LinkedInPrivacy-preserving Analytics and Data Mining at LinkedIn
Privacy-preserving Analytics and Data Mining at LinkedInKrishnaram Kenthapadi
 
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...Krishnaram Kenthapadi
 

More from Krishnaram Kenthapadi (12)

Fairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsFairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML Systems
 
Explainable AI in Industry (FAT* 2020 Tutorial)
Explainable AI in Industry (FAT* 2020 Tutorial)Explainable AI in Industry (FAT* 2020 Tutorial)
Explainable AI in Industry (FAT* 2020 Tutorial)
 
Fairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsFairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML Systems
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (KD...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (KD...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (KD...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (KD...
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WW...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WW...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WW...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WW...
 
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
 
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
 
Fairness, Transparency, and Privacy in AI @ LinkedIn
Fairness, Transparency, and Privacy in AI @ LinkedInFairness, Transparency, and Privacy in AI @ LinkedIn
Fairness, Transparency, and Privacy in AI @ LinkedIn
 
Privacy-preserving Analytics and Data Mining at LinkedIn
Privacy-preserving Analytics and Data Mining at LinkedInPrivacy-preserving Analytics and Data Mining at LinkedIn
Privacy-preserving Analytics and Data Mining at LinkedIn
 
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...
 

Recently uploaded

Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Lucknow
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Roomishabajaj13
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
Complet Documnetation for Smart Assistant Application for Disabled Person
Complet Documnetation   for Smart Assistant Application for Disabled PersonComplet Documnetation   for Smart Assistant Application for Disabled Person
Complet Documnetation for Smart Assistant Application for Disabled Personfurqan222004
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts servicevipmodelshub1
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一3sw2qly1
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Roomdivyansh0kumar0
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 

Recently uploaded (20)

Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
Complet Documnetation for Smart Assistant Application for Disabled Person
Complet Documnetation   for Smart Assistant Application for Disabled PersonComplet Documnetation   for Smart Assistant Application for Disabled Person
Complet Documnetation for Smart Assistant Application for Disabled Person
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 

LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers

  • 1. LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers Krishnaram Kenthapadi Staff Software Engineer and Applied Researcher, LinkedIn (Joint work with Ahsan Chudhary, Stuart Ambler)
  • 2. Outline ▪ LinkedIn Salary Overview ▪ Challenges: Privacy, Modeling ▪ System Design & Architecture ▪ Privacy vs. Modeling Tradeoffs
  • 4. Salary Collection Flow via Email Targeting
  • 5. Current Reach (July 2017) ▪ A few million responses out of several millions of members targeted – Targeted via emails since early 2016 ▪ Countries: US, CA, UK, DE ▪ Insights available for a large fraction of US monthly active users
  • 6. ▪ Minimize the risk of inferring any one individual’s compensation data ▪ Protection against data breach – No single point of failure Data Privacy Challenges Achieved by a combination of techniques: encryption, access control, , aggregation, thresholding
  • 7. ▪ Modeling on aggregated data ▪ Evaluation ▪ Outlier detection ▪ Robustness and stability Modeling Challenges K. Kenthapadi, S. Ambler, L. Zhang, and D. Agarwal, Bringing salary transparency to the world: Computing robust compensation insights via LinkedIn Salary, CIKM 2017 (arxiv.org/abs/1703.09845)
  • 8. Problem Statement ▪How do we design LinkedIn Salary system taking into account the unique privacy and security challenges, while addressing the product requirements?
  • 9. Differential Privacy? [Dwork et al, 2006] ▪Rich privacy literature (Adam-Worthmann, Samarati-Sweeney, Agrawal- Srikant, …, Kenthapadi et al, Machanavajjhala et al, Li et al, Dwork et al) – Limitation of anonymization techniques (Backstrom et al,…,Narayanan et al) ▪Worst case sensitivity of quantiles to any one user’s compensation data is large –  Large noise to be added, depriving reliability/usefulness ▪Need compensation insights on a continual basis – Theoretical work on applying differential privacy under continual observations ▪ No practical implementations / applications – Randomized response based approaches (Google’s RAPPOR; Apple) not applicable
  • 10. Title Region $$ User Exp Designer SF Bay Area 100K User Exp Designer SF Bay Area 115K ... ... ... Title Region $$ User Exp Designer SF Bay Area 100K De-identification Example Title Region Company Industry Years of exp Degree FoS Skills $$ User Exp Designer SF Bay Area Google Internet 12 BS Interacti ve Media UX, Graphic s, ... 100K Title Region Industry $$ User Exp Designer SF Bay Area Internet 100K Title Region Years of exp $$ User Exp Designer SF Bay Area 10+ 100K Title Region Company Years of exp $$ User Exp Designer SF Bay Area Google 10+ 100K #data points > threshold? Yes ⇒ Copy to Hadoop (HDFS) Note: Original submission stored as encrypted objects.
  • 13. Collection & Storage ▪Allow members to submit their compensation info ▪Extract member attributes – E.g., canonical job title, company, region, by invoking LinkedIn standardization services ▪Securely store member attributes & compensation data
  • 15. De-identification & Grouping ▪Approach inspired by k-Anonymity [Samarati-Sweeney] ▪“Cohort” or “Slice” – Defined by a combination of attributes – E.g, “User experience designers in SF Bay Area” ▪ Contains aggregated compensation entries from corresponding individuals ▪ No user name, id or any attributes other than those that define the cohort – A cohort available for offline processing only if it has at least k entries – Apply LinkedIn standardization software (free-form attribute  canonical version) before grouping ▪ Analogous to the generalization step in k-Anonymity
  • 16. ▪Slicing service – Access member attribute info & submission identifiers (no compensation data) – Generate slices & track # submissions for each slice ▪Preparation service – Fetch compensation data (using submission identifiers), associate with the slice data, copy to HDFS De-identification & Grouping
  • 18. Insights & Modeling ▪Salary insight service – Check whether the member is eligible ▪ Give-to-get model – If yes, show the insights ▪Offline workflow – Consume de-identified HDFS dataset – Compute robust compensation insights ▪ Outlier detection ▪ Bayesian smoothing – Populate the insight key- value stores
  • 20. Security Mechanisms ▪Encryption of member attributes & compensation data using different sets of keys – Separation of processing – Limiting access to the keys
  • 21. Security Mechanisms ▪Key rotation ▪No single point of failure ▪Infra security
  • 22. Preventing Timestamp Join based Attacks ▪Inference attack by joining these on timestamp – De-identified compensation data – Page view logs (when a member accessed compensation collection web interface) –  Not desirable to retain the exact timestamp ▪Perturb by adding random delay (say, up to 48 hours) ▪Modification based on k-Anonymity – Generalization using a hierarchy of timestamps – But, need to be incremental –  Process entries within a cohort in batches of size k ▪ Generalize to a common timestamp ▪ Make additional data available only in such incremental batches
  • 23. Privacy vs Modeling Tradeoffs ▪Our system deployed in production for ~1.5 years ▪Study tradeoffs between privacy guarantees (‘k’) and data available for computing insights – Dataset: Compensation submission history from 1.5M LinkedIn members – Amount of data available vs. minimum threshold, k – Effect of processing entries in batches of size, k
  • 26. Median delay due to batching vs. batch size, k
  • 27. Summary ▪ LinkedIn Salary: a new internet application – Privacy and Modeling Challenges – System Design & Architecture – Privacy vs. Modeling Tradeoffs ▪ Potential directions – Privacy-preserving machine learning models in a practical setting [e.g., Chaudhuri et al, Papernot et al]
  • 28. Thanks & Pointers ▪ Related tech report: K. Kenthapadi, S. Ambler, L. Zhang, and D. Agarwal, Bringing salary transparency to the world: Computing robust compensation insights via LinkedIn Salary, 2017 (arxiv.org/abs/1703.09845) ▪ Team: Careers Engineering  Ahsan Chudhary  Alan Yang  Alex Navasardyan  Brandyn Bennett  Hrishikesh S  Jim Tao  Juan Pablo Lomeli Diaz  Lu Zheng  Patrick Schutz  Ricky Yan  Stephanie Chou  Joseph Florencio  Santosh Kumar Kancha  Anthony Duerr Data Relevance Engineering  Krishnaram Kenthapadi, Stuart Ambler, Yiqun Liu, Parul Jain, Liang Zhang, Ganesh Venkataraman, Tim Converse, Deepak Agarwal Product Managers: Ryan Sandler, Keren Baruch UED: Julie Kuang Marketing: Phil Bunge Business Operations: Prateek Janardhan BA: Fiona Li Testing: Bharath Shetty ProdOps/VOM: Sunil Mahadeshwar Security: Cory Scott, Tushar Dalvi, and team linkedin.com/salary

Editor's Notes

  1. Corresponding paper: LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers (https://arxiv.org/abs/1705.06976), IEEE Symposium on Privacy-Aware Computing, 2017. Why LinkedIn Salary: Compensation a key factor when choosing a new job opportunity But, not easily available (asymmetry between job seekers and job providers) Goal: help job seekers explore compensation along different dimensions, make more informed career decisions / optimize their earning potential Compensation data can also help improve other LinkedIn product/services such as job recommendations Other social benefits: Better understand the monetary dimensions of the economic graph Greater transparency / address pay inequality Greater efficiency in the labor marketplace (reduce asymmetry of knowledge) Encourage workers to learn skills needed for obtaining well paying jobs (narrow the skills gap)
  2. Overall design incorporating a combination of techniques such as encryption, access control, de-identification, aggregation, thresholding.
  3. Example link: https://www.linkedin.com/salary/explorer?countryCode=us&regionCode=84&titleId=3114 In the publicly launched LinkedIn Salary product, users can explore compensation insights by searching for different titles and regions. For a given title and location, we present the quantiles (10th and 90th percentiles, median) and histograms for base salary, bonus, and other types of compensation. We also present more granular insights on how the pay varies based on factors such as region, experience, education, company size, and industry, and which locations, industries, or companies pay the most.
  4. We started reaching out to members during early 2016. The compensation insights shown in the product are based on compensation data that we have been collecting from LinkedIn users. We designed a give-to-get model based data collection process as follows. First, cohorts (such as User Experience Designers in San Francisco Bay Area) with a sufficient number of LinkedIn users are selected. Within each cohort, emails are sent to a random subset of users, requesting them to submit their compensation data (in return for aggregated compensation insights later). Once we collect sufficient data, we get back to the responding users with the compensation insights, and also reach out to the remaining users in those cohorts, promising corresponding insights immediately upon submission of their compensation data. Data collection process (at a high level): Select (title, region) cohorts with enough members Wave 1: Emails sent to a random subset in each cohort, requesting members to submit their salary (with promise of insights once there is enough data) Wave 2: Once there is enough data, get back to the responding members with insights, and also reach out to the remaining members, promising immediate insights
  5. Considering the sensitive nature of compensation data and the desire for preserving privacy of users, we designed our system such that there is protection against data breach, and any one individual's compensation data cannot be inferred by observing the outputs of the system. Encryption: Member attributes and compensation data encrypted separately De-identification: Slice data points along limited number of attributes towards de-identification Aggregation: Sliced data grouped before processing, subject to minimum threshold
  6. Modeling Challenges Modeling on aggregated data: Due to the privacy requirements, the salary modeling system has access only to cohort level data containing aggregated compensation submissions (e.g., salaries for UX Designers in San Francisco Bay Area), limited to those cohorts having at least a minimum number of entries. Each cohort is defined by a combination of attributes such as title, country, region, company, and years of experience, and contains aggregated compensation entries obtained from individuals having the same values of those attributes. Within a cohort, each individual entry consists of values for different compensation types such as base salary, annual bonus, sign-on bonus, commission, annual monetary value of vested stocks, and tips, and is available without associated user name, id, or any attributes other than those that define the cohort. Consequently, our modeling choices are limited since we have access only to the aggregated data, and cannot, for instance, build prediction models that make use of more discriminating features not available due to de-identification. Evaluation: In contrast to several other user-facing products such as movie and job recommendations, we face unique evaluation and data quality challenges. Users themselves may not have a good perception of the true compensation range, and hence it is not feasible to perform online A/B testing to compare the compensation insights generated by different models. Further, there are very few reliable and easily available ground truth datasets in the compensation domain, and even when available (e.g., BLS OES dataset), mapping such datasets to LinkedIn's taxonomy is inevitably noisy. Outlier Detection: As the quality of the insights depends on the quality of submitted data, detecting and pruning potential outlier entries is crucial. Such entries could arise due to either mistakes/misunderstandings during submission, or intentional falsification (such as someone attempting to game the system). We needed a solution to this problem that would work even during the early stages of data collection, when this problem was more challenging, and there may not be sufficient data across say, related cohorts. Robustness and Stability: While some cohorts may each have a large sample size, a large number of cohorts typically contain very few (< 20) data points each. Given the desire to have data for as many cohorts as possible, we need to ensure that the compensation insights are robust and stable even when there is data sparsity. That is, for such cohorts, the insights should be reliable, and not too sensitive to the addition of a new entry. A related challenge is whether we can reliably infer the insights for cohorts with no data at all. Our problem can thus be stated as follows: How do we design LinkedIn Salary system to meet the immediate and future needs of LinkedIn Salary and other LinkedIn products? How do we design our system taking into account the unique privacy and security challenges, while addressing the product requirements?
  7. How do we design LinkedIn Salary system to meet the immediate and future needs of LinkedIn Salary and other LinkedIn products?
  8. Before proceeding further: a natural question is: why not use rigorous privacy techniques such as differential privacy There is rich literature in the field of privacy-preserving data mining spanning different research communities (e.g., [11], [12], [20], [21], [22], [23], [24], [25], [26], [27], [28]), as well as on the limitations of simple anonymization techniques (e.g., [29], [30], [31]). Based on the lessons learned from the privacy literature, we first attempted to make use of rigorous privacy techniques such as differential privacy [32], [33] in our problem setting. However, we soon realized that these are not applicable in our context for the following reasons: (1) the amount of noise to be added to the quantiles, histograms, and other insights would be very large (thereby depriving the compensation insights of their reliability and usefulness), since the worst case sensitivity of these functions to any one user’s compensation data could be large, and (2) the insights need to be provided on a continual basis with the arrival of new data points. Although there is theoretical work on applying differential privacy under continual observations [34], [35], we have not come across any practical implementations or applications of these techniques. We also explored approaches similar to recent work at Google [36] and Apple [37] on privacy-preserving data collection at scale that focuses on applications such as learning statistics about how unwanted software is hijacking users’ settings in Chrome browser and discovering the usage patterns of a large number of iOS users for improving the touch keyboard respectively. These approaches are (or seem to be) built on the concept of randomized response [38] and require response from typically hundreds of thousands of users for the results to be useful. In contrast, even the larger of our cohorts contain only a few thousand data points, and hence these approaches are not applicable in our setting. P.S. Please see the paper (https://arxiv.org/abs/1705.06976) for the numbered references above.
  9. Note: for illustration purposes, we have hidden / abstracted several details. In particular, the member attribute fields and the compensation data are stored in encrypted form (with different set of keys). Once we have enough entries to meet the threshold, we put slice data in a queue so that it can be associated with compensation data.
  10. Our system uses a service oriented architecture (see Figure 3), and consists of the following three key components: a collection and storage component a de-identification and grouping component, and an insights and modeling component.
  11. Pursued this approach due to the inherent security limitations of Hadoop and HDFS
  12. Ensures that the compensation data is secure even if the Databus service is breached Mechanism for modifying the timestamp to prevent timestamp based inference attacks
  13. Next: different security mechanisms
  14. Not only store the member attributes & the compensation data in encrypted form, but also use different encryption keys for both No service has a need to simultaneously decrypt both These are never processed at the same time or by the same service  An attacker would need to break into both encryption systems Separation of slicing service and preparation service In the unlikely event of the submission service being breached, the attacker cannot decrypt the historical data Limiting access to keys: each service only has access to the public keys needed for encryption and the private keys needed for decryption
  15. Use just the submission history data, not the compensation data itself for these experiments
  16. Only used cohorts with at least 3 entries for this analysis 65% of cohorts available with k = 5 and 38% with k = 10, compared to k = 3 About one third have 3 or 4 entries each; another one-third have between 5 and 9 entries each; rest have 10+ entries
  17. Effect of processing entries within a cohort in batches of size, k (in addition to requiring a minimum threshold of k) Incremental privacy protection, but some data unavailable (e.g., 6th, 7th, 8th entries not made available if there are 8 submissions in a cohort and k = 5) 4.5% of the submissions withheld with k = 3 13% withheld with k = 5 24% withheld with k = 10
  18. Effect of processing entries within a cohort in batches of size, k (in addition to requiring a minimum threshold of k) Median delay within each cohort, limited to those entries that would get processed E.g., for k = 5 and a cohort with 8 entries submitted on days 1, 2, …, 8, we ignore the last three entries, and obtain median delay = 2 days For each k, we show the distribution across all cohorts: Median over cohorts == red horizontal line; Box boundaries == q1 and q3 respectively Median delay for three-fourth of the cohorts is less than about 5 weeks for batch sizes up to 10, although it could be as high as 300 to 400 days for a few outlier cohorts (that receive relatively fewer / infrequent submissions)
  19. Applicability of provably privacy-preserving machine learning approaches Would require a redesign; build richer predictions models while preserving privacy Outlier detection during submission stage Using user profile & behavioral features
  20. Link to this paper: LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers (https://arxiv.org/abs/1705.06976)