Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

High accuracy ML & AI over sensitive data

565 views

Published on

Covers research frontiers in privacy-preserving computation and provides practical strategies for training machine learning & AI models over sensitive data based on the lessons learned from Swoop's work with leading pharmaceutical and automotive companies. Example code in Apache Spark.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

High accuracy ML & AI over sensitive data

  1. 1. High-accuracy ML & AI over sensitive data Simeon Simeonov, Swoop @simeons / sim at swoop dot com
  2. 2. omni-channel marketing for your ideal population supported by privacy-preserving ML/AI e.g., we improve health outcomes by increasing the diagnosis rate of rare diseases through doctor/patient education
  3. 3. Swoop & IPM.ai data for 300+M people • Anonymized patient data • Online activity • Imprecise location data • Demographics, psychographics, purchase behavior, … Privacy by design: HIPAA-compliant prAIvacy™ platform. Trusted by the largest pharma companies. GDPR compliant.
  4. 4. Privacy-preserving computation frontiers • Stochastic – Differential privacy • Encryption-based – Fully homomorphic encryption • Protocol-based – Secure multi-party computation (SMC)
  5. 5. When privacy-preserving algorithms are immature, sanitize the data the algorithms are trained on
  6. 6. Privacy concerns stem from identifiability • Direct (via personally-identifiable information) • Indirect (via quasi-identifiers) Sim Simeonov; Male; July 7, 1977 One Swoop Way, Cambridge, MA 02140
  7. 7. Addressing identifiability in a single dataset • Direct – Generate secure pseudonymous identifiers – Often uses clean room to process PII • Indirect – Sanitize quasi-identifiers to desired anonymity trade-offs – Control data enhancement to maintain anonymity anonymity == indistinguishability
  8. 8. Sanitizing quasi-identifiers • Deterministic – Generalize or suppress quasi-identifiers – k-anonymity + derivatives • any given record maps onto at least k-1 other records • Stochastic – Add noise to data – (k, ℇ)-anonymity • Domain-specific
  9. 9. Addressing identifiability across datasets • Centralized approach – Join all data + sanitize the whole – Big increase in dimensionality • Federated approach – Keep data separate + sanitize operations across data – Smallest possible increase in dimensionality
  10. 10. We show that when the data contains a large number of attributes which may be considered quasi-identifiers, it becomes difficult to anonymize the data without an unacceptably high amount of information loss. ... we are faced with ... either completely suppressing most of the data or losing the desired level of anonymity. On k-Anonymity and the Curse of Dimensionality 2005 Aggarwal, C. @ IBM T. J. Watson Research Center Centralized sanitization hurts ML/AI accuracy
  11. 11. We find that for privacy budgets effective at preventing attacks, patients would be exposed to increased risk of stroke, bleeding events, and mortality. Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing 2014 Fredrikson, M. et. al. @ UW Madison and Marshfield Clinic Research Foundation Centralized sanitization increases risk
  12. 12. Normalized Certainty Penalty (NCP) 0% 5% 10% 15% 20% 25% 30% 35% 40% 2 3 4 5 6 7 8 9 10 k age gender & age k-anonymizing Titanic passenger survivability
  13. 13. Federated sanitization: Swoop’s prAIvacy™ • Secure, isolated data pools • Automated sanitization • Min dimensionality growth • Deterministic + stochastic • Optimal + often lossless Model condition X score on other data
  14. 14. Putting it all to practice (using Spark) • Pre-process data • Generate secure pseudonymous identifiers • Sanitize quasi-identifiers
  15. 15. dirty quasi-identifiers increase distinguishability: clean data before sanitization to prevent increased sanitization loss
  16. 16. no anonymization framework for unstructured data: suppress or structure
  17. 17. Word embedding for text anonymization • Text ➞ high-dimensionality vector – Capture semantics “Texas” + “Milwaukee” – “Wisconsin” ≃ “Dallas” – ML/AI-friendly representation – word2vec, doc2vec, GloVe, … • Anonymizing embeddings – Train secret embeddings model – Add noise to vectors
  18. 18. Secure pseudonymous ID generation Sim|Simeonov|M|1977-07-07|02140 8daed4fa67a07d7a5 … 6f574021 gPGIoVw … wnNpij1LveZRtKeWU= Sim Simeonov; Male; July 7, 1977 One Swoop Way, Cambridge, MA 02140 // consistent serialization // secure destructive hashing (SHA-xxx) // master encryption (AES-xxx) Vw50jZjh6BCWUzSVu … mfUFtyGZ3q // partner A encryption 6ykWEv7A2lisz8KUi … VT2ZddaOeML // partner B encryption Sim Simeonov; M; 1977-07-07 One Swoop Way, Suite 305, Cambridge, MA 02140 ...
  19. 19. Multiple IDs for dirty data Sim|Simeonov|M|1977-07-07|02140 // full entry when data is clean S|S551|M|1977-07-07|02140 // fuzzify names to handle limited entry & typos Sim|Simeonov|M|1977-07|02140 // also may reduce dob/geo accuracy tune fuzzification to use cases & desired FP/FN rates
  20. 20. Build pseudonymous IDs with Spark (and sanitize PII-based quasi-identifiers)
  21. 21. We need a few user-defined functions • Strong secure hash function with very few collisions – sha256(data) computes SHA-256 • Strong symmetric key encryption – aes_encrypt(data, secret) in Hive but not ported to Spark – aes__encrypt(data, secret) is a UDF to avoid name conflict • Demo sugar to build secrets from pass phrases – secret(pass_phrase)
  22. 22. Let’s create some PII case class PII(firstName: String, lastName: String, gender: String, dob: String, zip: String) val sim = PII("Sim", "Simeonov", "M", "1977-07-07", "02140") val ids = spark.createDataset(Seq(sim))
  23. 23. Consistent serialization val p = lit("|") // just a pipe symbol to save us typing lazy val idRules = Seq( // Rule 1: Use all PII concat(upper('firstName), p, upper('lastName), p, 'gender, p, 'dob, p, 'zip), // Rule 2: Use only first initial of first name and soundex of last name concat(upper('firstName.substr(1, 1)), p, soundex(upper('lastName)), p, 'gender, p, 'dob, p, 'zip) )
  24. 24. Hash & encrypt // The pseudonymous ID columns built from the rules lazy val psids = { val masterPassword = "Master Password" // master password to encrypt IDs with // Serialize -> Hash -> Encrypt idRules.zipWithIndex.map { case (serialization, idx) => aes__encrypt(sha256(serialization), secret(lit(masterPassword))) .as(s"psid${idx + 1}") } }
  25. 25. PII-based quasi-identifiers // Generalization of quasi-identifying columns lazy val quasiIdCols: Seq[Column] = Seq( 'gender, 'dob.substr(1, 4).cast(IntegerType).as("yob"), // only year of birth 'zip.substr(1, 3).cast(IntegerType).as("zip3") // only first 3 digits of zip )
  26. 26. Generate master IDs // Master pseudonymous IDs lazy val masterIds = ids.select(quasiIdCols ++ psids: _*)
  27. 27. Generate per partner IDs val partnerPasswords = Map("A" -> "A Password", "B" -> "B Password") val partnerIds = spark.createDataset(partnerPasswords.toSeq) .toDF("partner_name", "pwd").withColumn("pwd", secret('pwd)) .crossJoin(masterIds) .transform { df => psids.indices.foldLeft(df) { case (current, idx) => val colName = s"psid${idx + 1}" current.withColumn(colName, base64(aes__encrypt(col(colName), 'pwd))) } } .drop("pwd")
  28. 28. The end result
  29. 29. Sanitizing quasi-identifiers in Spark • Optimal k-anonymity is an NP-hard problem – Mondrian algorithm: greedy O(nlogn) approximation • https://github.com/eubr-bigsea/k-anonymity-mondrian • Active research – Locale-sensitive hashing (LSH) improvements – Risk-based approaches (e.g., LBS algorithm)
  30. 30. Interested in challenging data engineering, ML & AI on petabytes of data? I’d love to hear from you. @simeons / sim at swoop dot com https://databricks.com/session/great-models-with-great-privacy-optimizing-ml-ai-under-gdpr https://databricks.com/session/the-smart-data-warehouse-goal-based-data-production https://swoop-inc.github.io/spark-records/ Privacy matters. Thank you for caring.

×