Preserving privacy of users is a key requirement of web-scale analytics and reporting applications, and has witnessed a renewed focus in light of recent data breaches and new regulations such as GDPR. We focus on the problem of computing robust, reliable analytics in a privacy-preserving manner, while satisfying product requirements. We present PriPeARL, a framework for privacy-preserving analytics and reporting, inspired by differential privacy. We describe the overall design and architecture, and the key modeling components, focusing on the unique challenges associated with privacy, coverage, utility, and consistency. We perform an experimental study in the context of ads analytics and reporting at LinkedIn, thereby demonstrating the tradeoffs between privacy and utility needs, and the applicability of privacy-preserving mechanisms to real-world data. We also highlight the lessons learned from the production deployment of our system at LinkedIn.
Presented at ACM CIKM 2018. Link to our paper: https://arxiv.org/pdf/1809.07754
PriPeARL: A Framework for Privacy-Preserving Analytics and Reporting at LinkedIn
PriPeARL: A Framework for Privacy-
Preserving Analytics and Reporting at
Krishnaram Kenthapadi, Thanh Tran
Data @ LinkedIn
Analytics Products at LinkedIn
Profile View Analytics
Ad Campaign Analytics
All showing demographics
of members engaging with
Product Requirements: Utility and Privacy
• Insights into the audience engaging with
the product (e.g., profile, article, or ad)
→ Desirable for the aggregate statistics
to be available and accurate.
• Different aspects of data consistency:
- Repeated queries
- Over time
- Total vs. Demographic breakdowns
- Hierarchy (e.g., time, entity)
• Member actions could be considered
sensitive information (e.g., click on an
article or an ad).
→ Individual’s action cannot be
inferred from the results of analytics.
• Assume malicious use cases, e.g.,
attacker can set up ad campaigns to
infer the behavior of a certain member.
Application: LinkedIn Ads Analytics
Compute robust, reliable analytics in a privacy-preserving
manner, while addressing the product desiderata such as utility,
coverage, and consistency.
Senior directors in US, who studied at Cornell
Matches ~16k LinkedIn members
→ over minimum targeting threshold
E.g., company = X
Matches exactly one person
→ can determine whether the person
clicks on the ad or not
Enforcing minimum reporting threshold
Attacker could create fake profiles
E.g., if threshold is 10, create 9 fake
profiles that all click.
E.g., report incremental of 10
Still amenable to attacks
E.g., using incremental counts over time
to infer individuals’ actions
Need rigorous techniques to preserve member privacy, not
revealing exact aggregate counts
Differential Privacy: Definition
● ε-Differential Privacy: For neighboring databases D and D’ (differ by one record),
the distribution of the curator’s outputs on both databases are nearly the same .
● Parameter ε (ε > 0) quantifies information leakage
○ Smaller ε, more private
Dwork, McSherry, Nissim, Smith [TCC 2006]
Differential Privacy: Random Noise Addition
● Achieving differential privacy via random noise addition.
● Common approach: noise draw from the Laplace distribution.
○ Let s be L1 sensitivity of the query function f
s = max D, D’ || f(D) - f(D’) ||, D and D’ differ by one record
○ and ε the privacy parameter.
○ Then the parameter for Laplace distribution is (s/ε)
Dwork, McSherry, Nissim, Smith [TCC 2006]
● This query form also applies for other analytics applications
Ad Analytics Canonical Queries
FROM table(stateType, entity)
WHERE timestamp ≥ startTime AND timestamp ≤ endTime
AND dAttr = dVal
E.g., clicks on a given ad
E.g., Title = “Senior Director”
● Application admits a predetermined query form.
● Preserving privacy by adding Laplace noise
○ Protect privacy at the event level
PriPeARL: A Framework for Privacy-Preserving Analytics
Pseudo-random noise generation, inspired by differential privacy
● Entity id (creative/campaign/
● Demographic dimension
● Stat type (impressions, clicks)
● Time range
● Fixed secret seed
● Normalize to
● Fixed ε
To satisfy consistency
● Pseudo-random noise → same query has same result over time, avoid
● For non-canonical queries (e.g., time ranges, aggregate multiple entities)
○ Use the hierarchy and partition into canonical queries
○ Compute noise for each canonical queries and sum up the noisy counts
Implemented and integrated into Ads Analytics product.
Can be used for general analytics product.
Performance Evaluation: Setup
● Experiments using LinkedIn ad analytics data
○ Consider distribution of impression and click queries
across (account, ad campaign) and demographic
○ Tradeoff between privacy and utility
○ Effect of varying minimum threshold (non-negative)
○ Top-n queries
Performance Evaluation: Results
Privacy and Utility Tradeoff
● For ε = 1, average absolute and signed errors
are small for both queries.
● Variance is also small, ~95% of queries have
error of at most 2.
● Common use case in LinkedIn applications.
● Jaccard distance as a function of ε and n.
● (This shows the worst case as queries with
return sets ≤ n and error=0 were omitted.)
● Lessons from privacy breaches → need “Privacy by Design”
● Consider business requirements and usability
○ Various consistency desiderata to ensure results useful and insightful
● Scaling across analytics applications
○ Abstract away application specifics, build libraries, and optimize for
▸ AI/ML: Krishnaram Kenthapadi, Thanh T. L. Tran
▸ Ad Analytics Product & Engineering: Mark Dietz, Taylor Greason, Ian
▸ Legal / Security: Sara Harrington, Sharon Lee, Rohit Pitke
▹ Additional Acknowledgements
▸ Deepak Agarwal, Igor Perisic, Arun Swami, Ya Xu, Yang Zhou
▹ Framework to compute robust, privacy-preserving analytics
▸ Addressing challenges such as preserving member privacy, product
coverage, utility, and data consistency
▸ Utility maximization problem given constraints on the ‘privacy loss budget’ per user
⬩ E.g., noise with larger variance to impressions but less noise to clicks (or
⬩ E.g., more noise to broader time range sub-queries and less noise to granular
time range sub-queries
▹ Tech Report: K. Kenthapadi, T. Tran, PriPeARL: A Framework for Privacy-
Preserving Analytics and Reporting at LinkedIn, ACM CIKM 2018
What’s Next: Privacy for ML / Data Applications
▹ Hard open questions
▸ Can we simultaneously develop highly personalized models
and ensure that the models do not encode private information
▸ How do we guarantee member privacy over time without
exhausting the “privacy loss budget”?
▸ How do we enable privacy-preserving mechanisms for data