Online professional social networks such as LinkedIn have enhanced the ability of job seekers to discover and assess career opportunities, and the ability of job providers to discover and assess potential candidates. For most job seekers, salary (or broadly compensation) is a crucial consideration in choosing a new job. At the same time, job seekers face challenges in learning the compensation associated with different jobs, given the sensitive nature of compensation data and the dearth of reliable sources containing compensation data. Towards the goal of helping the world’s professionals optimize their earning potential through salary transparency, we present LinkedIn Salary, a system for collecting compensation information from LinkedIn members and providing compensation insights to job seekers. We present the overall design and architecture, and describe the key components needed for the secure collection, de-identification, and processing of compensation data, focusing on the unique challenges associated with privacy and security. We perform an experimental study with more than one year of compensation submission history data collected from over 1.5 million LinkedIn members, thereby demonstrating the tradeoffs between privacy and modeling needs. We also highlight the lessons learned from the production deployment of this system at LinkedIn.
Presented at IEEE Symposium on Privacy-Aware Computing (IEEE PAC), 2017.
Corresponding paper: LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers, IEEE Symposium on Privacy-Aware Computing, 2017 (available at https://arxiv.org/abs/1705.06976).
LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers
1. LinkedIn Salary: A System for Secure
Collection and Presentation of Structured
Compensation Insights to Job Seekers
Krishnaram Kenthapadi
Staff Software Engineer and Applied Researcher, LinkedIn
(Joint work with Ahsan Chudhary, Stuart Ambler)
2. Outline
▪ LinkedIn Salary Overview
▪ Challenges: Privacy, Modeling
▪ System Design & Architecture
▪ Privacy vs. Modeling Tradeoffs
5. Current Reach (July 2017)
▪ A few million responses out of several millions of
members targeted
– Targeted via emails since early 2016
▪ Countries: US, CA, UK, DE
▪ Insights available for a large fraction of US monthly
active users
6. ▪ Minimize the risk of inferring any one individual’s
compensation data
▪ Protection against data breach
– No single point of failure
Data Privacy Challenges
Achieved by a combination of techniques:
encryption, access control,
, aggregation, thresholding
7. ▪ Modeling on aggregated data
▪ Evaluation
▪ Outlier detection
▪ Robustness and stability
Modeling Challenges
K. Kenthapadi, S. Ambler,
L. Zhang, and D. Agarwal,
Bringing salary transparency to
the world: Computing robust
compensation insights via
LinkedIn Salary, CIKM 2017
(arxiv.org/abs/1703.09845)
8. Problem Statement
▪How do we design LinkedIn Salary system taking into
account the unique privacy and security challenges, while
addressing the product requirements?
9. Differential Privacy? [Dwork et al, 2006]
▪Rich privacy literature (Adam-Worthmann, Samarati-Sweeney, Agrawal-
Srikant, …, Kenthapadi et al, Machanavajjhala et al, Li et al, Dwork et al)
– Limitation of anonymization techniques (Backstrom et al,…,Narayanan et al)
▪Worst case sensitivity of quantiles to any one user’s
compensation data is large
– Large noise to be added, depriving reliability/usefulness
▪Need compensation insights on a continual basis
– Theoretical work on applying differential privacy under continual observations
▪ No practical implementations / applications
– Randomized response based approaches (Google’s RAPPOR; Apple) not applicable
10. Title Region
$$
User Exp
Designer
SF Bay
Area
100K
User Exp
Designer
SF Bay
Area
115K
... ... ...
Title Region
$$
User Exp
Designer
SF Bay
Area
100K
De-identification Example
Title Region Company Industry Years
of exp
Degree FoS Skills
$$
User Exp
Designer
SF Bay
Area
Google Internet 12 BS Interacti
ve
Media
UX,
Graphic
s, ...
100K
Title Region Industry
$$
User Exp
Designer
SF Bay
Area
Internet 100K
Title Region Years
of exp
$$
User Exp
Designer
SF Bay
Area
10+ 100K
Title Region Company Years
of exp
$$
User Exp
Designer
SF Bay
Area
Google 10+ 100K
#data
points >
threshold? Yes ⇒ Copy to
Hadoop (HDFS)
Note: Original submission stored as encrypted objects.
13. Collection & Storage
▪Allow members to submit their compensation info
▪Extract member attributes
– E.g., canonical job title, company, region, by invoking LinkedIn
standardization services
▪Securely store member attributes & compensation data
15. De-identification & Grouping
▪Approach inspired by k-Anonymity [Samarati-Sweeney]
▪“Cohort” or “Slice”
– Defined by a combination of attributes
– E.g, “User experience designers in SF Bay Area”
▪ Contains aggregated compensation entries from corresponding individuals
▪ No user name, id or any attributes other than those that define the cohort
– A cohort available for offline processing only if it has at least k entries
– Apply LinkedIn standardization software (free-form attribute canonical
version) before grouping
▪ Analogous to the generalization step in k-Anonymity
16. ▪Slicing service
– Access member attribute
info & submission identifiers
(no compensation data)
– Generate slices & track #
submissions for each slice
▪Preparation service
– Fetch compensation data
(using submission
identifiers), associate with
the slice data, copy to
HDFS
De-identification & Grouping
18. Insights & Modeling
▪Salary insight service
– Check whether the
member is eligible
▪ Give-to-get model
– If yes, show the insights
▪Offline workflow
– Consume de-identified
HDFS dataset
– Compute robust
compensation insights
▪ Outlier detection
▪ Bayesian smoothing
– Populate the insight key-
value stores
22. Preventing Timestamp Join based Attacks
▪Inference attack by joining these on timestamp
– De-identified compensation data
– Page view logs (when a member accessed compensation collection web interface)
– Not desirable to retain the exact timestamp
▪Perturb by adding random delay (say, up to 48 hours)
▪Modification based on k-Anonymity
– Generalization using a hierarchy of timestamps
– But, need to be incremental
– Process entries within a cohort in batches of size k
▪ Generalize to a common timestamp
▪ Make additional data available only in such incremental batches
23. Privacy vs Modeling Tradeoffs
▪Our system deployed in production for ~1.5 years
▪Study tradeoffs between privacy guarantees (‘k’) and data
available for computing insights
– Dataset: Compensation submission history from 1.5M LinkedIn members
– Amount of data available vs. minimum threshold, k
– Effect of processing entries in batches of size, k
27. Summary
▪ LinkedIn Salary: a new internet application
– Privacy and Modeling Challenges
– System Design & Architecture
– Privacy vs. Modeling Tradeoffs
▪ Potential directions
– Privacy-preserving machine learning models in a practical
setting [e.g., Chaudhuri et al, Papernot et al]
28. Thanks & Pointers
▪ Related tech report: K. Kenthapadi, S. Ambler, L. Zhang, and D. Agarwal,
Bringing salary transparency to the world: Computing robust compensation
insights via LinkedIn Salary, 2017 (arxiv.org/abs/1703.09845)
▪ Team:
Careers Engineering
Ahsan Chudhary
Alan Yang
Alex Navasardyan
Brandyn Bennett
Hrishikesh S
Jim Tao
Juan Pablo Lomeli Diaz
Lu Zheng
Patrick Schutz
Ricky Yan
Stephanie Chou
Joseph Florencio
Santosh Kumar Kancha
Anthony Duerr
Data Relevance Engineering
Krishnaram Kenthapadi, Stuart Ambler, Yiqun Liu, Parul Jain, Liang
Zhang, Ganesh Venkataraman, Tim Converse, Deepak Agarwal
Product Managers: Ryan Sandler, Keren Baruch
UED: Julie Kuang
Marketing: Phil Bunge
Business Operations: Prateek Janardhan
BA: Fiona Li
Testing: Bharath Shetty
ProdOps/VOM: Sunil Mahadeshwar
Security: Cory Scott, Tushar Dalvi, and team
linkedin.com/salary
Editor's Notes
Corresponding paper: LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers (https://arxiv.org/abs/1705.06976), IEEE Symposium on Privacy-Aware Computing, 2017.
Why LinkedIn Salary:
Compensation a key factor when choosing a new job opportunity
But, not easily available (asymmetry between job seekers and job providers)
Goal: help job seekers explore compensation along different dimensions, make more informed career decisions / optimize their earning potential
Compensation data can also help improve other LinkedIn product/services such as job recommendations
Other social benefits:
Better understand the monetary dimensions of the economic graph
Greater transparency / address pay inequality
Greater efficiency in the labor marketplace (reduce asymmetry of knowledge)
Encourage workers to learn skills needed for obtaining well paying jobs (narrow the skills gap)
Overall design incorporating a combination of techniques such as encryption, access control, de-identification, aggregation, thresholding.
Example link: https://www.linkedin.com/salary/explorer?countryCode=us®ionCode=84&titleId=3114
In the publicly launched LinkedIn Salary product, users can explore compensation insights by searching for different titles and regions. For a given title and location, we present the quantiles (10th and 90th percentiles, median) and histograms for base salary, bonus, and other types of compensation. We also present more granular insights on how the pay varies based on factors such as region, experience, education, company size, and industry, and which locations, industries, or companies pay the most.
We started reaching out to members during early 2016.
The compensation insights shown in the product are based on compensation data that we have been collecting from LinkedIn users. We designed a give-to-get model based data collection process as follows. First, cohorts (such as User Experience Designers in San Francisco Bay Area) with a sufficient number of LinkedIn users are selected. Within each cohort, emails are sent to a random subset of users, requesting them to submit their compensation data (in return for aggregated compensation insights later). Once we collect sufficient data, we get back to the responding users with the compensation insights, and also reach out to the remaining users in those cohorts, promising corresponding insights immediately upon submission of their compensation data.
Data collection process (at a high level):
Select (title, region) cohorts with enough members
Wave 1: Emails sent to a random subset in each cohort, requesting members to submit their salary (with promise of insights once there is enough data)
Wave 2: Once there is enough data, get back to the responding members with insights, and also reach out to the remaining members, promising immediate insights
Considering the sensitive nature of compensation data and the desire for preserving privacy of users, we designed our system such that there is protection against data breach, and any one individual's compensation data cannot be inferred by observing the outputs of the system.
Encryption:
Member attributes and compensation data encrypted separately
De-identification:
Slice data points along limited number of attributes towards de-identification
Aggregation:
Sliced data grouped before processing, subject to minimum threshold
Modeling Challenges
Modeling on aggregated data: Due to the privacy requirements, the salary modeling system has access only to cohort level data containing aggregated compensation submissions (e.g., salaries for UX Designers in San Francisco Bay Area), limited to those cohorts having at least a minimum number of entries. Each cohort is defined by a combination of attributes such as title, country, region, company, and years of experience, and contains aggregated compensation entries obtained from individuals having the same values of those attributes. Within a cohort, each individual entry consists of values for different compensation types such as base salary, annual bonus, sign-on bonus, commission, annual monetary value of vested stocks, and tips, and is available without associated user name, id, or any attributes other than those that define the cohort. Consequently, our modeling choices are limited since we have access only to the aggregated data, and cannot, for instance, build prediction models that make use of more discriminating features not available due to de-identification.
Evaluation: In contrast to several other user-facing products such as movie and job recommendations, we face unique evaluation and data quality challenges. Users themselves may not have a good perception of the true compensation range, and hence it is not feasible to perform online A/B testing to compare the compensation insights generated by different models. Further, there are very few reliable and easily available ground truth datasets in the compensation domain, and even when available (e.g., BLS OES dataset), mapping such datasets to LinkedIn's taxonomy is inevitably noisy.
Outlier Detection: As the quality of the insights depends on the quality of submitted data, detecting and pruning potential outlier entries is crucial. Such entries could arise due to either mistakes/misunderstandings during submission, or intentional falsification (such as someone attempting to game the system). We needed a solution to this problem that would work even during the early stages of data collection, when this problem was more challenging, and there may not be sufficient data across say, related cohorts.
Robustness and Stability: While some cohorts may each have a large sample size, a large number of cohorts typically contain very few (< 20) data points each. Given the desire to have data for as many cohorts as possible, we need to ensure that the compensation insights are robust and stable even when there is data sparsity. That is, for such cohorts, the insights should be reliable, and not too sensitive to the addition of a new entry. A related challenge is whether we can reliably infer the insights for cohorts with no data at all.
Our problem can thus be stated as follows: How do we design LinkedIn Salary system to meet the immediate and future needs of LinkedIn Salary and other LinkedIn products? How do we design our system taking into account the unique privacy and security challenges, while addressing the product requirements?
How do we design LinkedIn Salary system to meet the immediate and future needs of LinkedIn Salary and other LinkedIn products?
Before proceeding further: a natural question is: why not use rigorous privacy techniques such as differential privacy
There is rich literature in the field of privacy-preserving data mining spanning different research communities (e.g., [11], [12], [20], [21], [22], [23], [24], [25], [26], [27], [28]), as well as on the limitations of simple anonymization techniques (e.g., [29], [30], [31]). Based on the lessons learned from the privacy literature, we first attempted to make use of rigorous privacy techniques such as differential privacy [32], [33] in our problem setting. However, we soon realized that these are not applicable in our context for the following reasons: (1) the amount of noise to be added to the quantiles, histograms, and other insights would be very large (thereby depriving the compensation insights of their reliability and usefulness), since the worst case sensitivity of these functions to any one user’s compensation data could be large, and (2) the insights need to be provided on a continual basis with the arrival of new data points. Although there is theoretical work on applying differential privacy under continual observations [34], [35], we have not come across any practical implementations or applications of these techniques. We also explored approaches similar to recent work at Google [36] and Apple [37] on privacy-preserving data collection at scale that focuses on applications such as learning statistics about how unwanted software is hijacking users’ settings in Chrome browser and discovering the usage patterns of a large number of iOS users for improving the touch keyboard respectively. These approaches are (or seem to be) built on the concept of randomized response [38] and require response from typically hundreds of thousands of users for the results to be useful. In contrast, even the larger of our cohorts contain only a few thousand data points, and hence these approaches are not applicable in our setting.
P.S. Please see the paper (https://arxiv.org/abs/1705.06976) for the numbered references above.
Note: for illustration purposes, we have hidden / abstracted several details. In particular, the member attribute fields and the compensation data are stored in encrypted form (with different set of keys).
Once we have enough entries to meet the threshold, we put slice data in a queue so that it can be associated with compensation data.
Our system uses a service oriented architecture (see Figure 3), and consists of the following three key components:
a collection and storage component
a de-identification and grouping component, and
an insights and modeling component.
Pursued this approach due to the inherent security limitations of Hadoop and HDFS
Ensures that the compensation data is secure even if the Databus service is breached
Mechanism for modifying the timestamp to prevent timestamp based inference attacks
Next: different security mechanisms
Not only store the member attributes & the compensation data in encrypted form, but also use different encryption keys for both
No service has a need to simultaneously decrypt both
These are never processed at the same time or by the same service An attacker would need to break into both encryption systems
Separation of slicing service and preparation service
In the unlikely event of the submission service being breached, the attacker cannot decrypt the historical data
Limiting access to keys: each service only has access to the public keys needed for encryption and the private keys needed for decryption
Use just the submission history data, not the compensation data itself for these experiments
Only used cohorts with at least 3 entries for this analysis
65% of cohorts available with k = 5 and 38% with k = 10, compared to k = 3
About one third have 3 or 4 entries each; another one-third have between 5 and 9 entries each; rest have 10+ entries
Effect of processing entries within a cohort in batches of size, k (in addition to requiring a minimum threshold of k)
Incremental privacy protection, but some data unavailable (e.g., 6th, 7th, 8th entries not made available if there are 8 submissions in a cohort and k = 5)
4.5% of the submissions withheld with k = 3
13% withheld with k = 5
24% withheld with k = 10
Effect of processing entries within a cohort in batches of size, k (in addition to requiring a minimum threshold of k)
Median delay within each cohort, limited to those entries that would get processed
E.g., for k = 5 and a cohort with 8 entries submitted on days 1, 2, …, 8, we ignore the last three entries, and obtain median delay = 2 days
For each k, we show the distribution across all cohorts: Median over cohorts == red horizontal line; Box boundaries == q1 and q3 respectively
Median delay for three-fourth of the cohorts is less than about 5 weeks for batch sizes up to 10, although it could be as high as 300 to 400 days for a few outlier cohorts (that receive relatively fewer / infrequent submissions)
Applicability of provably privacy-preserving machine learning approaches
Would require a redesign; build richer predictions models while preserving privacy
Outlier detection during submission stage
Using user profile & behavioral features
Link to this paper: LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers (https://arxiv.org/abs/1705.06976)