Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Great Models with Great Privacy: Optimizing ML and AI Over Sensitive Data

264 views

Published on

There is a growing feeling that privacy concerns dampen innovation in machine learning and AI applied to personal and/or sensitive data. After all, ML and AI are hungry for rich, detailed data and sanitizing data to improve privacy typically involves redacting or fuzzing inputs, which multiple studies have shown can seriously affect model quality and predictive power. While this is technically true for some privacy-safe modeling techniques, it's not true in general. The root cause of the problem is two-fold. First, most data scientists have never learned how to produce great models with great privacy. Second, most companies lack the systems to make privacy-preserving machine learning & AI easy. This talk will challenge the implicit assumption that more privacy means worse predictions. Using practical examples from production environments involving personal and sensitive data, the speakers will introduce a wide range of techniques-from simple hashing to advanced embeddings-for high-accuracy, privacy-safe model development. Key topics include pseudonymous ID generation, semantic scrubbing, structure-preserving data fuzzing, task-specific vs. task-independent sanitization and ensuring downstream privacy in multi-party collaborations. In addition, we will dig into embeddings as a unique deep learning-based approach for privacy-preserving modeling over unstructured data. Special attention will be given to Spark-based production environments.

Speakers: Sim Simeonov Slater Victoroff

Published in: Data & Analytics
  • Have you ever used the help of ⇒ www.WritePaper.info ⇐? They can help you with any type of writing - from personal statement to research paper. Due to this service you'll save your time and get an essay without plagiarism.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Earn Up To $316/day! Easy Writing Jobs from the comfort of home! ♣♣♣ http://t.cn/AieXSfKU
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Great Models with Great Privacy: Optimizing ML and AI Over Sensitive Data

  1. 1. Great Models with Great Privacy Optimizing ML & AI Over Sensitive Data Sim Simeonov, CTO, Swoop sim@swoop.com / @simeons Slater Victoroff, CTO, Indico slater@indico.io / @sl8rv #UnifiedAnalytics #SparkAISummit
  2. 2. privacy-preserving ML/AI to improve patient outcomes and pharmaceutical operations e.g., we improve the diagnosis rate of rare diseases
  3. 3. Intelligent Process Automation for Unstructured Content using ML/AI with strong data protection guarantees e.g. we automate the processing of loan documents using a combination of NLP and Computer Vision
  4. 4. Regulation affects ML & AI • General Data Protection Regulation (GDPR) – Already in effect in the EU • California Consumer Protection Act (CCPA) – Comes into effect Jan 1, 2020 • Many federal efforts under way – Information Transparency and Data Control Act (DelBene, D-WA) – Consumer Data Protection Act (Wyden, D-OR) – Data Care Act (Schatz, D-HI)
  5. 5. accuracy vs. privacy is a false dichotomy (if you are willing to invest in privacy infrastructure)
  6. 6. Privacy-preserving computation frontiers • Stochastic – Differential privacy (DP) • Encryption-based – Fully homomorphic encryption (FHE) • Protocol-based – Secure multi-party computation (SMC)
  7. 7. When privacy-preserving algorithms are immature, sanitize the data the algorithms are trained on
  8. 8. Session roadmap • The rest of this session – Sanitizing a single dataset using Spark • After the break – Sanitizing joinable datasets the smart way – Embeddings for unstructured data (deep learning!)
  9. 9. What does it mean to sanitize data for privacy?
  10. 10. Identifiability spectrum Personally-Identified Information (PII) Device-Identified Information (DII)De-Identified Information
  11. 11. Direct identifiability • Personal information (PI) • Sanitize with pseudonymous identifiers – Secure one-way mapping of PI to opaque IDs Sim Simeonov; Male; July 7, 1977 One Swoop Way, Cambridge, MA 02140
  12. 12. Secure pseudonymous ID generation Sim|Simeonov|M|1977-07-07|02140 8daed4fa67a07d7a5 … 6f574021 gPGIoVw … nNpij1LveZRtKeWU= Sim Simeonov; Male; July 7, 1977 One Swoop Way, Cambridge, MA 02140 // canonical representation // secure destructive hashing (SHA-xxx) // master encryption (AES-xxx) Vw50jZjh6BCWUz … mfUFtyGZ3q // partner A encryption 6ykWEv7A2lis8 … VT2ZddaOeML // partner B encryption Simeon Simeonov; M; 1977-07-07 One Swoop Way, Suite 305, Cambridge, MA 02140 ...
  13. 13. Dealing with dirty data Sim|Simeonov|M|1977-07-07|02140 // full entry when data is clean S|S551|M|1977-07-07|02140 // fuzzify names to handle limited entry & typos Sim|Simeonov|M|1977-07|02140 // generalize structured fields (dates, locations, …) tune fuzzification to use cases & desired FP/FN rates
  14. 14. Building pseudonymous IDs with Spark
  15. 15. Indirect identifiability via quasi-identifiers
  16. 16. Sanitizing quasi-identifiers • k-anonymity – Generalize or suppress quasi-identifiers – Any record is “similar” to at least k-1 other records • (k, ℇ)-anonymity – adds noise
  17. 17. Sanitizing quasi-identifiers in Spark • Optimal k-anonymity is an NP-hard problem – Mondrian algorithm: greedy O(nlogn) approximation https://github.com/eubr-bigsea/k-anonymity-mondrian • Active research – Locale-sensitive hashing (LSH) improvements – Risk-based approaches (LBS algorithm) – Academic Spark implementations (TDS, MDSBA)
  18. 18. Sanitizing joinable datasets • The industry standard: centralized sanitization • Sanitize the data as if it were joined together • Big increase in quasi-identifiers
  19. 19. We show that when the data contains a large number of attributes which may be considered quasi-identifiers, it becomes difficult to anonymize the data without an unacceptably high amount of information loss. ... we are faced with ... either completely suppressing most of the data or losing the desired level of anonymity. On k-Anonymity and the Curse of Dimensionality 2005 Aggarwal, C. @ IBM T. J. Watson Research Center The curse of dimensionality for sanitization
  20. 20. Curse of dimensionality example • Titanic passenger data • k-anonymize for different values of k by – Age – Age & gender • Compute normalized certainty penalty (NCP) – Higher values mean more information loss
  21. 21. Normalized Certainty Penalty 0% 5% 10% 15% 20% 25% 30% 35% 40% 2 3 4 5 6 7 8 9 10 k age gender & age k-anonymizing Titanic passenger survivability 300%
  22. 22. Centralized sanitization describes an alternate reality
  23. 23. We find that for privacy budgets effective at preventing attacks, patients would be exposed to increased risk of stroke, bleeding events, and mortality. Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing 2014 Fredrikson, M. et. al. @ UW Madison and Marshfield Clinic Research Foundation Centralized sanitization increases risk
  24. 24. High-accuracy ML/AI requires federated sanitization
  25. 25. Federated sanitization (Swoop’s prAIvacy™) • Isolated execution environments • Workflows execute across EEs • Task-specific sanitization firewalls • Sanitization is often lossless Model condition X without mixing health data with other data
  26. 26. There is no standard anonymization framework for unstructured data: suppress or structure or ???
  27. 27. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation.
  28. 28. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem PII
  29. 29. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem PII Solution(s) • Remove common names? • Tell Doctors to stop using names in their notes? • Lookup patient information in notes and intentional remove
  30. 30. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem Quasi-Identifiers
  31. 31. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem Quasi-Identifiers Solution(s) • Remove all locations? • Remove all gendered pronouns? • Remove all numbers?
  32. 32. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem Weak Identifiers
  33. 33. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem Weak Identifiers Solution(s) • Remove all text
  34. 34. The Real Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem Predictive/Clinical Value
  35. 35. The Real Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem Predictive/Clinical Value Accuracy Matters
  36. 36. The Real Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem Predictive/Clinical Value Accuracy Matters Solution(s) • Embeddings
  37. 37. What’s Changed? • I2b2 dataset for de-identification (PII & Quasi-identifiers) – Standard NLP (Part of speech tagging, rules-based, keywords) • 0.91 F1 after months of writing rules [1, 2] • Drops to 0.79 F1 on new data (Unusable) [1, 2] – Modern NLP (Deep learning, embeddings) • 0.98 F1 without any rule-writing. (Roughly a 10x error reduction) [1,3] • >0.99 F1 on new data [1,3] [1]: Max Friedrich. Data Protection Compliant Exchange of Training Data for Automatic De-Identification of Medical Text November 20, 2018 https://www.inf.uni- hamburg.de/en/inst/ab/lt/teaching/theses/completed-theses/2018-ma-friedrich.pdf [2]: Amber Stubbs, Michele Filannino, and Ozlem Uzuner. De-identification of psychiatric ¨ intake records: Overview of 2016 CEGS N-GRID shared tasks track 1. Journal of Biomedical Informatics, 75(S):S4–S18, November 2017. ISSN 1532-0464. URL https://doi.org/10.1016/j.jbi.2017.06.011. [3]: Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits. De-identification of patient notes with recurrent neural networks. Journal of the American Medical References Informatics Association, 24(3):596–606, 2017b. URL https://doi.org/10.1093/ jamia/ocw156.
  38. 38. What is an Embedding? Text Space (e.g. English) Embedding Space (e.g. R300) Embedding Method (e.g. Word2Vec) 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 …
  39. 39. What is an Embedding? Text Space (e.g. English) Embedding Space (e.g. R300) 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … Embedding Method (e.g. Word2Vec)
  40. 40. Words Manifold
  41. 41. What is an Embedding? • Text Space – Full Algorithmic Utility – Limited to No Guarantees on Privacy – Sparse, Inherently Brittle • Embedding Space – Very High Algorithmic Utility – Strong Privacy Guarantees – Dense, Generically More Generalizable
  42. 42. Why are embeddings helpful? “This representation would still require humans to annotate PHI (as this is the training data for the task) but the pseudonymization step would be performed by the transformation to the representation. A tool trained on the representation could easily be made publicly available because its parameters cannot contain any protected data, as it is never trained on raw text.” - Max Friedrich (Data Protection Compliant Exchange of Training Data for Automatic De-Identification of Medical Text) November 20, 2018 https://www.inf.uni-hamburg.de/en/inst/ab/lt/teaching/theses/completed- theses/2018-ma-friedrich.pdf
  43. 43. King Queen - man + woman (Royalty) How do Embeddings Work? • Meaning is “encoded” into the embedding space • Individual dimensions are not human interpretable • Embedding method learns by examining large corpora of generic language • Goal is accurate language representation as a proxy for downstream performance
  44. 44. “Word” Embeddings Examples • Word2vec • GloVe • fastText
  45. 45. “Word” Embeddings Token Value “great” [0.1, 0.3, …] … … Examples In Practice • Word2vec • GloVe • fastText
  46. 46. “Word” Embeddings Token Value “great” [0.1, 0.3, …] … … Examples In Practice Training The quick brown fox _____ over the lazy dog ___ ___ ____ ___ jumps ___ __ ___ ___ CBOW Skip Gram • Word2vec • GloVe • fastText
  47. 47. Do They Really Preserve Algorithmic Value? 0.5 0.55 0.6 0.65 0.7 0.75 0.8 50 75 100 125 150 175 200 225 250 275 300 Word2vec Benchmark (IMDB Sentiment Analysis) tf-idf word2vec • Embeddings generally outperform raw text at low data volumes • Leveraging large, generic text corpora improves generalizability • This is 6 year old tech. Embeddings have improved drastically. Text has not. Reported numbers are the average of 5 runs of randomly sampled test/train splits each reporting the average of a 5-fold cv, within which Logistic Regression hyperparameters are optimized. Generated using Enso
  48. 48. Do They Really Preserve Algorithmic Value? • Embeddings generally outperform raw text at low data volumes • Leveraging large, generic text corpora improves generalizability • This is 4 year old tech. Embeddings have improved drastically. Text has not. Reported numbers are the average of 5 runs of randomly sampled test/train splits each reporting the average of a 5-fold cv, within which Logistic Regression hyperparameters are optimized. Generated using Enso 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 Accuracy Number of Data Points Glove Benchmark (Rotten Tomatoes Sentiment Analysis) tf-idf Glove
  49. 49. Text Embeddings Examples • Doc2vec • Elmo • ULMFiT
  50. 50. Text Embeddings Examples In Practice Often built on top of pre-trained word embeddings • Doc2vec • Elmo • ULMFiT
  51. 51. Text Embeddings Examples In Practice Training The quick brown fox jumps over the lazy 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … Language Supervised dog True Often built on top of pre-trained word embeddings • Doc2vec • Elmo • ULMFiT
  52. 52. Text Embeddings CNN-Style The quick brown fox jumps over the lazy 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … Prediction https://arxiv.org/pdf/1408.5882.pdf Example
  53. 53. Text Embeddings RNN-Style The quick brown fox jumps over the lazy 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … Output Memory 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.5 0.3 0.4 0.2 0.1 0.1 0.9 0.3 … 0.3 0.5 0.6 0.2 0.4 0.4 0.9 0.9 … 0.1 0.1 0.2 0.3 0.9 0.5 0.1 0.5 … 0.8 0.3 0.7 0.9 0.7 0.9 0.3 0.5 … 0.9 0.5 0.5 0.6 0.5 0.2 0.8 0.1 … 0.9 0.6 0.1 0.7 0.4 0.4 0.8 0.1 … 0.6 0.5 0.8 0.6 0.9 0.6 0.3 0.1 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … σ σ σ σ σ σ σ σ Prediction https://arxiv.org/pdf/1802.05365.pdf Example
  54. 54. Additional Notes • “Word” Embeddings – May not correspond directly to words – Many excellent public implementations – Vulnerable to rainbow-table attacks (Do not store) • Text Embeddings – Often built on top of word embeddings – K-anonymity can be directly assessed – Naively implemented as mean of word embeddings
  55. 55. Embedding Options • Public – Ready-to-use downloadable word vectors – Trained on public datasets (e.g. word2vec trained on CommonCrawl) • Homegrown – Embedding technique trained in-house – Typically a straightforward algorithm on a novel data corpus • Third Party – Sourced from vendor responsible for updating data sources and algorithms
  56. 56. Embedding Options (Public) Notes Ease of Use Algorithmic Advantage Intrinsic Privacy Efficacy Transparency & Control Low Medium High • Public embeddings effectively mean public rainbow tables • Licensing is often ambiguous or incompatible with commercial use
  57. 57. Embedding Options (Homegrown) Notes Ease of Use Algorithmic Advantage Intrinsic Privacy Efficacy Transparency & Control Low Medium High • Deceptively easy to start, extremely difficult to effectively use in production • While this can theoretically be a good approach, for most organizations this is a bad idea.
  58. 58. Embedding Options (Third party) Notes Ease of Use Algorithmic Advantage Intrinsic Privacy Efficacy Transparency & Control Low Medium High • Data leakage can result in handing away competitive edge • Efficacy highly variable. Ensure robust benchmarks before working with any third party • Privacy-controlled environment makes involving a third party contractually difficult
  59. 59. State of Embeddings in Spark • Training yourself – Reasonable word2vec implementations – Proper evaluation requires two nested CV loops + a private holdout at a minimum • Basic NLP Functionality – John Snow NLP Library (https://nlp.johnsnowlabs.com/) • Pretrained Embeddings – Compute offline in python (Gensim, Spacy, indico)
  60. 60. Summary • Privacy-preserving computation is important • De-identification is easy with Spark • Beware centralized sanitization • Embeddings are great for privacy
  61. 61. Interested in challenging data engineering, ML & AI on very big data? We’d love to hear from you. sim@swoop.com / slater@indico.io Privacy matters. Thank you for caring.

×