There is a growing feeling that privacy concerns dampen innovation in machine learning and AI applied to personal and/or sensitive data. After all, ML and AI are hungry for rich, detailed data and sanitizing data to improve privacy typically involves redacting or fuzzing inputs, which multiple studies have shown can seriously affect model quality and predictive power. While this is technically true for some privacy-safe modeling techniques, it's not true in general. The root cause of the problem is two-fold. First, most data scientists have never learned how to produce great models with great privacy. Second, most companies lack the systems to make privacy-preserving machine learning & AI easy. This talk will challenge the implicit assumption that more privacy means worse predictions. Using practical examples from production environments involving personal and sensitive data, the speakers will introduce a wide range of techniques-from simple hashing to advanced embeddings-for high-accuracy, privacy-safe model development. Key topics include pseudonymous ID generation, semantic scrubbing, structure-preserving data fuzzing, task-specific vs. task-independent sanitization and ensuring downstream privacy in multi-party collaborations. In addition, we will dig into embeddings as a unique deep learning-based approach for privacy-preserving modeling over unstructured data. Special attention will be given to Spark-based production environments.
Speakers: Sim Simeonov Slater Victoroff
Great Models with Great Privacy: Optimizing ML and AI Over Sensitive Data
1. Great Models with Great Privacy
Optimizing ML & AI Over Sensitive Data
Sim Simeonov, CTO, Swoop
sim@swoop.com / @simeons
Slater Victoroff, CTO, Indico
slater@indico.io / @sl8rv
#UnifiedAnalytics #SparkAISummit
2. privacy-preserving ML/AI to improve patient
outcomes and pharmaceutical operations
e.g., we improve the diagnosis rate of rare diseases
3. Intelligent Process Automation for Unstructured Content
using ML/AI with strong data protection guarantees
e.g. we automate the processing of loan documents using
a combination of NLP and Computer Vision
4. Regulation affects ML & AI
• General Data Protection Regulation (GDPR)
– Already in effect in the EU
• California Consumer Protection Act (CCPA)
– Comes into effect Jan 1, 2020
• Many federal efforts under way
– Information Transparency and Data Control Act (DelBene, D-WA)
– Consumer Data Protection Act (Wyden, D-OR)
– Data Care Act (Schatz, D-HI)
5. accuracy vs. privacy is a false dichotomy
(if you are willing to invest in privacy infrastructure)
8. Session roadmap
• The rest of this session
– Sanitizing a single dataset using Spark
• After the break
– Sanitizing joinable datasets the smart way
– Embeddings for unstructured data (deep learning!)
11. Direct identifiability
• Personal information (PI)
• Sanitize with pseudonymous identifiers
– Secure one-way mapping of PI to opaque IDs
Sim Simeonov; Male; July 7, 1977
One Swoop Way, Cambridge, MA 02140
12. Secure pseudonymous ID generation
Sim|Simeonov|M|1977-07-07|02140
8daed4fa67a07d7a5 … 6f574021
gPGIoVw … nNpij1LveZRtKeWU=
Sim Simeonov; Male; July 7, 1977
One Swoop Way, Cambridge, MA 02140
// canonical representation
// secure destructive hashing (SHA-xxx)
// master encryption (AES-xxx)
Vw50jZjh6BCWUz … mfUFtyGZ3q // partner A encryption
6ykWEv7A2lis8 … VT2ZddaOeML // partner B encryption
Simeon Simeonov; M; 1977-07-07
One Swoop Way, Suite 305, Cambridge, MA 02140
...
13. Dealing with dirty data
Sim|Simeonov|M|1977-07-07|02140 // full entry when data is clean
S|S551|M|1977-07-07|02140 // fuzzify names to handle limited entry & typos
Sim|Simeonov|M|1977-07|02140 // generalize structured fields (dates, locations, …)
tune fuzzification to use cases & desired FP/FN rates
17. Sanitizing quasi-identifiers in Spark
• Optimal k-anonymity is an NP-hard problem
– Mondrian algorithm: greedy O(nlogn) approximation
https://github.com/eubr-bigsea/k-anonymity-mondrian
• Active research
– Locale-sensitive hashing (LSH) improvements
– Risk-based approaches (LBS algorithm)
– Academic Spark implementations (TDS, MDSBA)
18. Sanitizing joinable datasets
• The industry standard: centralized sanitization
• Sanitize the data as if it were joined together
• Big increase in quasi-identifiers
19. We show that when the data contains a large number of attributes which may be
considered quasi-identifiers, it becomes difficult to anonymize the data without an
unacceptably high amount of information loss. ... we are faced with ... either
completely suppressing most of the data or losing the desired level of anonymity.
On k-Anonymity and the Curse of Dimensionality
2005 Aggarwal, C. @ IBM T. J. Watson Research Center
The curse of dimensionality for sanitization
20. Curse of dimensionality example
• Titanic passenger data
• k-anonymize for different values of k by
– Age
– Age & gender
• Compute normalized certainty penalty (NCP)
– Higher values mean more information loss
23. We find that for privacy budgets effective at preventing attacks,
patients would be exposed to increased risk of stroke,
bleeding events, and mortality.
Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing
2014 Fredrikson, M. et. al. @ UW Madison and Marshfield Clinic Research Foundation
Centralized sanitization increases risk
25. Federated sanitization (Swoop’s prAIvacy™)
• Isolated execution environments
• Workflows execute across EEs
• Task-specific sanitization firewalls
• Sanitization is often lossless
Model condition X
without mixing health
data with other data
26. There is no standard anonymization framework
for unstructured data: suppress or structure or ???
27. The Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
28. The Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
PII
29. The Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
PII
Solution(s)
• Remove common names?
• Tell Doctors to stop using
names in their notes?
• Lookup patient information
in notes and intentional
remove
30. The Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
Quasi-Identifiers
31. The Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
Quasi-Identifiers
Solution(s)
• Remove all locations?
• Remove all gendered
pronouns?
• Remove all numbers?
32. The Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
Weak Identifiers
33. The Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
Weak Identifiers
Solution(s)
• Remove all text
34. The Real Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
Predictive/Clinical Value
35. The Real Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
Predictive/Clinical Value
Accuracy Matters
36. The Real Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
Predictive/Clinical Value
Accuracy Matters
Solution(s)
• Embeddings
37. What’s Changed?
• I2b2 dataset for de-identification (PII & Quasi-identifiers)
– Standard NLP (Part of speech tagging, rules-based, keywords)
• 0.91 F1 after months of writing rules [1, 2]
• Drops to 0.79 F1 on new data (Unusable) [1, 2]
– Modern NLP (Deep learning, embeddings)
• 0.98 F1 without any rule-writing. (Roughly a 10x error reduction) [1,3]
• >0.99 F1 on new data [1,3]
[1]: Max Friedrich. Data Protection Compliant Exchange of Training Data for Automatic De-Identification of Medical Text November 20, 2018 https://www.inf.uni-
hamburg.de/en/inst/ab/lt/teaching/theses/completed-theses/2018-ma-friedrich.pdf
[2]: Amber Stubbs, Michele Filannino, and Ozlem Uzuner. De-identification of psychiatric ¨ intake records: Overview of 2016 CEGS N-GRID shared tasks track 1. Journal of
Biomedical Informatics, 75(S):S4–S18, November 2017. ISSN 1532-0464. URL https://doi.org/10.1016/j.jbi.2017.06.011.
[3]: Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits. De-identification of patient notes with recurrent neural networks. Journal of the American
Medical References Informatics Association, 24(3):596–606, 2017b. URL https://doi.org/10.1093/ jamia/ocw156.
38. What is an Embedding?
Text Space
(e.g. English)
Embedding Space
(e.g. R300)
Embedding Method
(e.g. Word2Vec)
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
39. What is an Embedding?
Text Space
(e.g. English)
Embedding Space
(e.g. R300)
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Embedding Method
(e.g. Word2Vec)
41. What is an Embedding?
• Text Space
– Full Algorithmic Utility
– Limited to No Guarantees on Privacy
– Sparse, Inherently Brittle
• Embedding Space
– Very High Algorithmic Utility
– Strong Privacy Guarantees
– Dense, Generically More Generalizable
42. Why are embeddings helpful?
“This representation would still require humans to annotate PHI (as this
is the training data for the task) but the pseudonymization step would be
performed by the transformation to the representation. A tool trained on
the representation could easily be made publicly available because its
parameters cannot contain any protected data, as it is never trained on
raw text.”
- Max Friedrich (Data Protection Compliant
Exchange of Training Data for Automatic De-Identification
of Medical Text) November 20, 2018
https://www.inf.uni-hamburg.de/en/inst/ab/lt/teaching/theses/completed-
theses/2018-ma-friedrich.pdf
43. King
Queen
- man
+ woman
(Royalty)
How do Embeddings Work?
• Meaning is “encoded” into
the embedding space
• Individual dimensions are not
human interpretable
• Embedding method learns by
examining large corpora of
generic language
• Goal is accurate language
representation as a proxy for
downstream performance
46. “Word” Embeddings
Token Value
“great” [0.1, 0.3,
…]
… …
Examples In Practice
Training
The quick brown fox _____ over the lazy dog
___ ___ ____ ___ jumps ___ __ ___ ___
CBOW
Skip Gram
• Word2vec
• GloVe
• fastText
47. Do They Really Preserve Algorithmic Value?
0.5
0.55
0.6
0.65
0.7
0.75
0.8
50 75 100 125 150 175 200 225 250 275 300
Word2vec Benchmark (IMDB Sentiment Analysis)
tf-idf
word2vec
• Embeddings generally
outperform raw text at low
data volumes
• Leveraging large, generic
text corpora improves
generalizability
• This is 6 year old tech.
Embeddings have improved
drastically. Text has not.
Reported numbers are the average of 5 runs of randomly sampled test/train
splits each reporting the average of a 5-fold cv, within which Logistic
Regression hyperparameters are optimized. Generated using Enso
48. Do They Really Preserve Algorithmic Value?
• Embeddings generally
outperform raw text at low data
volumes
• Leveraging large, generic text
corpora improves
generalizability
• This is 4 year old tech.
Embeddings have improved
drastically. Text has not.
Reported numbers are the average of 5 runs of randomly sampled test/train splits
each reporting the average of a 5-fold cv, within which Logistic Regression
hyperparameters are optimized. Generated using Enso
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
50
75
100
125
150
175
200
225
250
275
300
325
350
375
400
425
450
475
500
Accuracy
Number of Data Points
Glove Benchmark (Rotten Tomatoes
Sentiment Analysis)
tf-idf
Glove
54. Additional Notes
• “Word” Embeddings
– May not correspond directly to words
– Many excellent public implementations
– Vulnerable to rainbow-table attacks (Do not store)
• Text Embeddings
– Often built on top of word embeddings
– K-anonymity can be directly assessed
– Naively implemented as mean of word embeddings
55. Embedding Options
• Public
– Ready-to-use downloadable word vectors
– Trained on public datasets (e.g. word2vec trained on CommonCrawl)
• Homegrown
– Embedding technique trained in-house
– Typically a straightforward algorithm on a novel data corpus
• Third Party
– Sourced from vendor responsible for updating data sources and
algorithms
56. Embedding Options (Public)
Notes
Ease of Use Algorithmic
Advantage
Intrinsic
Privacy
Efficacy Transparency
& Control
Low
Medium
High
• Public embeddings effectively mean
public rainbow tables
• Licensing is often ambiguous or
incompatible with commercial use
57. Embedding Options (Homegrown)
Notes
Ease of Use Algorithmic
Advantage
Intrinsic
Privacy
Efficacy Transparency
& Control
Low
Medium
High
• Deceptively easy to start, extremely
difficult to effectively use in production
• While this can theoretically be a good
approach, for most organizations this
is a bad idea.
58. Embedding Options (Third party)
Notes
Ease of Use Algorithmic
Advantage
Intrinsic
Privacy
Efficacy Transparency
& Control
Low
Medium
High
• Data leakage can result in handing
away competitive edge
• Efficacy highly variable. Ensure robust
benchmarks before working with any
third party
• Privacy-controlled environment makes
involving a third party contractually
difficult
59. State of Embeddings in Spark
• Training yourself
– Reasonable word2vec implementations
– Proper evaluation requires two nested CV loops + a private
holdout at a minimum
• Basic NLP Functionality
– John Snow NLP Library (https://nlp.johnsnowlabs.com/)
• Pretrained Embeddings
– Compute offline in python (Gensim, Spacy, indico)
60. Summary
• Privacy-preserving computation is important
• De-identification is easy with Spark
• Beware centralized sanitization
• Embeddings are great for privacy
61. Interested in challenging data engineering, ML & AI on very big data?
We’d love to hear from you. sim@swoop.com / slater@indico.io
Privacy matters. Thank you for caring.