Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PriPeARL: A Framework for Privacy-Preserving Analytics and Reporting at LinkedIn

1,890 views

Published on

Preserving privacy of users is a key requirement of web-scale analytics and reporting applications, and has witnessed a renewed focus in light of recent data breaches and new regulations such as GDPR. We focus on the problem of computing robust, reliable analytics in a privacy-preserving manner, while satisfying product requirements. We present PriPeARL, a framework for privacy-preserving analytics and reporting, inspired by differential privacy. We describe the overall design and architecture, and the key modeling components, focusing on the unique challenges associated with privacy, coverage, utility, and consistency. We perform an experimental study in the context of ads analytics and reporting at LinkedIn, thereby demonstrating the tradeoffs between privacy and utility needs, and the applicability of privacy-preserving mechanisms to real-world data. We also highlight the lessons learned from the production deployment of our system at LinkedIn.

Presented at ACM CIKM 2018. Link to our paper: https://arxiv.org/pdf/1809.07754

Published in: Internet
  • Be the first to comment

PriPeARL: A Framework for Privacy-Preserving Analytics and Reporting at LinkedIn

  1. 1. PriPeARL: A Framework for Privacy- Preserving Analytics and Reporting at LinkedIn CIKM 2018 Krishnaram Kenthapadi, Thanh Tran Data @ LinkedIn 1
  2. 2. Analytics Products at LinkedIn Profile View Analytics 2 Content Analytics Ad Campaign Analytics All showing demographics of members engaging with the product
  3. 3. Product Requirements: Utility and Privacy 3 • Insights into the audience engaging with the product (e.g., profile, article, or ad) → Desirable for the aggregate statistics to be available and accurate. • Different aspects of data consistency: - Repeated queries - Over time - Total vs. Demographic breakdowns - Hierarchy (e.g., time, entity) Utility Privacy • Member actions could be considered sensitive information (e.g., click on an article or an ad). → Individual’s action cannot be inferred from the results of analytics. • Assume malicious use cases, e.g., attacker can set up ad campaigns to infer the behavior of a certain member.
  4. 4. LMS Application: LinkedIn Ads Analytics 4 Objective: Compute robust, reliable analytics in a privacy-preserving manner, while addressing the product desiderata such as utility, coverage, and consistency. Ad Ad Targeting LI Ad Serving Ad Analytics Advertiser
  5. 5. Possible Attacks 5 Targeting: Senior directors in US, who studied at Cornell Matches ~16k LinkedIn members → over minimum targeting threshold Demographic breakdown: E.g., company = X Matches exactly one person → can determine whether the person clicks on the ad or not Enforcing minimum reporting threshold Attacker could create fake profiles E.g., if threshold is 10, create 9 fake profiles that all click. Rounding mechanism E.g., report incremental of 10 Still amenable to attacks E.g., using incremental counts over time to infer individuals’ actions Need rigorous techniques to preserve member privacy, not revealing exact aggregate counts
  6. 6. Differential Privacy: Definition 6 ● ε-Differential Privacy: For neighboring databases D and D’ (differ by one record), the distribution of the curator’s outputs on both databases are nearly the same . ● Parameter ε (ε > 0) quantifies information leakage ○ Smaller ε, more private Dwork, McSherry, Nissim, Smith [TCC 2006]
  7. 7. Differential Privacy: Random Noise Addition 7 ● Achieving differential privacy via random noise addition. ● Common approach: noise draw from the Laplace distribution. ○ Let s be L1 sensitivity of the query function f s = max D, D’ || f(D) - f(D’) ||, D and D’ differ by one record ○ and ε the privacy parameter. ○ Then the parameter for Laplace distribution is (s/ε) Dwork, McSherry, Nissim, Smith [TCC 2006]
  8. 8. ● This query form also applies for other analytics applications Ad Analytics Canonical Queries 8 SELECT COUNT(*) FROM table(stateType, entity) WHERE timestamp ≥ startTime AND timestamp ≤ endTime AND dAttr = dVal E.g., clicks on a given ad E.g., Title = “Senior Director” ● Application admits a predetermined query form. ● Preserving privacy by adding Laplace noise ○ Protect privacy at the event level
  9. 9. PriPeARL: A Framework for Privacy-Preserving Analytics 9 Pseudo-random noise generation, inspired by differential privacy ● Entity id (creative/campaign/ campaign group/account) ● Demographic dimension ● Stat type (impressions, clicks) ● Time range ● Fixed secret seed Uniformly Random Fraction ● Cryptographic hash ● Normalize to (0,1) Random Noise Laplace Noise ● Fixed ε True count Reported count To satisfy consistency requirements ● Pseudo-random noise → same query has same result over time, avoid averaging attack. ● For non-canonical queries (e.g., time ranges, aggregate multiple entities) ○ Use the hierarchy and partition into canonical queries ○ Compute noise for each canonical queries and sum up the noisy counts
  10. 10. System Architecture 10 Implemented and integrated into Ads Analytics product. Can be used for general analytics product.
  11. 11. Performance Evaluation: Setup 11 ● Experiments using LinkedIn ad analytics data ○ Consider distribution of impression and click queries across (account, ad campaign) and demographic breakdowns. ● Examine ○ Tradeoff between privacy and utility ○ Effect of varying minimum threshold (non-negative) ○ Top-n queries
  12. 12. Performance Evaluation: Results 12 Privacy and Utility Tradeoff ● For ε = 1, average absolute and signed errors are small for both queries. ● Variance is also small, ~95% of queries have error of at most 2. Top-N Queries ● Common use case in LinkedIn applications. ● Jaccard distance as a function of ε and n. ● (This shows the worst case as queries with return sets ≤ n and error=0 were omitted.)
  13. 13. Lessons Learned 13 ● Lessons from privacy breaches → need “Privacy by Design” ● Consider business requirements and usability ○ Various consistency desiderata to ensure results useful and insightful ● Scaling across analytics applications ○ Abstract away application specifics, build libraries, and optimize for performance
  14. 14. Acknowledgements ▹ Team: ▸ AI/ML: Krishnaram Kenthapadi, Thanh T. L. Tran ▸ Ad Analytics Product & Engineering: Mark Dietz, Taylor Greason, Ian Koeppe ▸ Legal / Security: Sara Harrington, Sharon Lee, Rohit Pitke ▹ Additional Acknowledgements ▸ Deepak Agarwal, Igor Perisic, Arun Swami, Ya Xu, Yang Zhou 14
  15. 15. ▹ Framework to compute robust, privacy-preserving analytics ▸ Addressing challenges such as preserving member privacy, product coverage, utility, and data consistency ▹ Future ▸ Utility maximization problem given constraints on the ‘privacy loss budget’ per user ⬩ E.g., noise with larger variance to impressions but less noise to clicks (or conversions) ⬩ E.g., more noise to broader time range sub-queries and less noise to granular time range sub-queries ▹ Tech Report: K. Kenthapadi, T. Tran, PriPeARL: A Framework for Privacy- Preserving Analytics and Reporting at LinkedIn, ACM CIKM 2018 (https://arxiv.org/pdf/1809.07754) Summary 15
  16. 16. What’s Next: Privacy for ML / Data Applications ▹ Hard open questions ▸ Can we simultaneously develop highly personalized models and ensure that the models do not encode private information of members? ▸ How do we guarantee member privacy over time without exhausting the “privacy loss budget”? ▸ How do we enable privacy-preserving mechanisms for data marketplaces? ▹ Thanks! 16
  17. 17. Appendix 17
  18. 18. Algorithm to Computing Noisy Analytics 18
  19. 19. Performance Evaluation: Results 19 Varying minimum thresholds

×