SlideShare a Scribd company logo
1 of 61
Download to read offline
Great Models with Great Privacy
Optimizing ML & AI Over Sensitive Data
Sim Simeonov, CTO, Swoop
sim@swoop.com / @simeons
Slater Victoroff, CTO, Indico
slater@indico.io / @sl8rv
#UnifiedAnalytics #SparkAISummit
privacy-preserving ML/AI to improve patient
outcomes and pharmaceutical operations
e.g., we improve the diagnosis rate of rare diseases
Intelligent Process Automation for Unstructured Content
using ML/AI with strong data protection guarantees
e.g. we automate the processing of loan documents using
a combination of NLP and Computer Vision
Regulation affects ML & AI
• General Data Protection Regulation (GDPR)
– Already in effect in the EU
• California Consumer Protection Act (CCPA)
– Comes into effect Jan 1, 2020
• Many federal efforts under way
– Information Transparency and Data Control Act (DelBene, D-WA)
– Consumer Data Protection Act (Wyden, D-OR)
– Data Care Act (Schatz, D-HI)
accuracy vs. privacy is a false dichotomy
(if you are willing to invest in privacy infrastructure)
Privacy-preserving computation frontiers
• Stochastic
– Differential privacy (DP)
• Encryption-based
– Fully homomorphic encryption (FHE)
• Protocol-based
– Secure multi-party computation (SMC)
When privacy-preserving algorithms are immature,
sanitize the data the algorithms are trained on
Session roadmap
• The rest of this session
– Sanitizing a single dataset using Spark
• After the break
– Sanitizing joinable datasets the smart way
– Embeddings for unstructured data (deep learning!)
What does it mean to sanitize data for privacy?
Identifiability spectrum
Personally-Identified Information (PII)
Device-Identified Information (DII)De-Identified Information
Direct identifiability
• Personal information (PI)
• Sanitize with pseudonymous identifiers
– Secure one-way mapping of PI to opaque IDs
Sim Simeonov; Male; July 7, 1977
One Swoop Way, Cambridge, MA 02140
Secure pseudonymous ID generation
Sim|Simeonov|M|1977-07-07|02140
8daed4fa67a07d7a5 … 6f574021
gPGIoVw … nNpij1LveZRtKeWU=
Sim Simeonov; Male; July 7, 1977
One Swoop Way, Cambridge, MA 02140
// canonical representation
// secure destructive hashing (SHA-xxx)
// master encryption (AES-xxx)
Vw50jZjh6BCWUz … mfUFtyGZ3q // partner A encryption
6ykWEv7A2lis8 … VT2ZddaOeML // partner B encryption
Simeon Simeonov; M; 1977-07-07
One Swoop Way, Suite 305, Cambridge, MA 02140
...
Dealing with dirty data
Sim|Simeonov|M|1977-07-07|02140 // full entry when data is clean
S|S551|M|1977-07-07|02140 // fuzzify names to handle limited entry & typos
Sim|Simeonov|M|1977-07|02140 // generalize structured fields (dates, locations, …)
tune fuzzification to use cases & desired FP/FN rates
Building pseudonymous IDs with Spark
Indirect identifiability via quasi-identifiers
Sanitizing quasi-identifiers
• k-anonymity
– Generalize or suppress quasi-identifiers
– Any record is “similar” to at least k-1 other records
• (k, ℇ)-anonymity
– adds noise
Sanitizing quasi-identifiers in Spark
• Optimal k-anonymity is an NP-hard problem
– Mondrian algorithm: greedy O(nlogn) approximation
https://github.com/eubr-bigsea/k-anonymity-mondrian
• Active research
– Locale-sensitive hashing (LSH) improvements
– Risk-based approaches (LBS algorithm)
– Academic Spark implementations (TDS, MDSBA)
Sanitizing joinable datasets
• The industry standard: centralized sanitization
• Sanitize the data as if it were joined together
• Big increase in quasi-identifiers
We show that when the data contains a large number of attributes which may be
considered quasi-identifiers, it becomes difficult to anonymize the data without an
unacceptably high amount of information loss. ... we are faced with ... either
completely suppressing most of the data or losing the desired level of anonymity.
On k-Anonymity and the Curse of Dimensionality
2005 Aggarwal, C. @ IBM T. J. Watson Research Center
The curse of dimensionality for sanitization
Curse of dimensionality example
• Titanic passenger data
• k-anonymize for different values of k by
– Age
– Age & gender
• Compute normalized certainty penalty (NCP)
– Higher values mean more information loss
Normalized Certainty Penalty
0%
5%
10%
15%
20%
25%
30%
35%
40%
2 3 4 5 6 7 8 9 10
k age gender & age
k-anonymizing Titanic passenger survivability
300%
Centralized sanitization describes an alternate reality
We find that for privacy budgets effective at preventing attacks,
patients would be exposed to increased risk of stroke,
bleeding events, and mortality.
Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing
2014 Fredrikson, M. et. al. @ UW Madison and Marshfield Clinic Research Foundation
Centralized sanitization increases risk
High-accuracy ML/AI requires federated sanitization
Federated sanitization (Swoop’s prAIvacy™)
• Isolated execution environments
• Workflows execute across EEs
• Task-specific sanitization firewalls
• Sanitization is often lossless
Model condition X
without mixing health
data with other data
There is no standard anonymization framework
for unstructured data: suppress or structure or ???
The Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
The Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
PII
The Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
PII
Solution(s)
• Remove common names?
• Tell Doctors to stop using
names in their notes?
• Lookup patient information
in notes and intentional
remove
The Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
Quasi-Identifiers
The Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
Quasi-Identifiers
Solution(s)
• Remove all locations?
• Remove all gendered
pronouns?
• Remove all numbers?
The Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
Weak Identifiers
The Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
Weak Identifiers
Solution(s)
• Remove all text
The Real Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
Predictive/Clinical Value
The Real Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
Predictive/Clinical Value
Accuracy Matters
The Real Problem With Text
John Malkovitch plays tennis in Winchester.
He has been reporting soreness in his
elbow. His 60th birthday is in two weeks.
After he returns from his birthday trip to
Casablanca we will recommend a steroid
shot to reduce inflammation.
Problem
Predictive/Clinical Value
Accuracy Matters
Solution(s)
• Embeddings
What’s Changed?
• I2b2 dataset for de-identification (PII & Quasi-identifiers)
– Standard NLP (Part of speech tagging, rules-based, keywords)
• 0.91 F1 after months of writing rules [1, 2]
• Drops to 0.79 F1 on new data (Unusable) [1, 2]
– Modern NLP (Deep learning, embeddings)
• 0.98 F1 without any rule-writing. (Roughly a 10x error reduction) [1,3]
• >0.99 F1 on new data [1,3]
[1]: Max Friedrich. Data Protection Compliant Exchange of Training Data for Automatic De-Identification of Medical Text November 20, 2018 https://www.inf.uni-
hamburg.de/en/inst/ab/lt/teaching/theses/completed-theses/2018-ma-friedrich.pdf
[2]: Amber Stubbs, Michele Filannino, and Ozlem Uzuner. De-identification of psychiatric ¨ intake records: Overview of 2016 CEGS N-GRID shared tasks track 1. Journal of
Biomedical Informatics, 75(S):S4–S18, November 2017. ISSN 1532-0464. URL https://doi.org/10.1016/j.jbi.2017.06.011.
[3]: Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits. De-identification of patient notes with recurrent neural networks. Journal of the American
Medical References Informatics Association, 24(3):596–606, 2017b. URL https://doi.org/10.1093/ jamia/ocw156.
What is an Embedding?
Text Space
(e.g. English)
Embedding Space
(e.g. R300)
Embedding Method
(e.g. Word2Vec)
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
What is an Embedding?
Text Space
(e.g. English)
Embedding Space
(e.g. R300)
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Embedding Method
(e.g. Word2Vec)
Words
Manifold
What is an Embedding?
• Text Space
– Full Algorithmic Utility
– Limited to No Guarantees on Privacy
– Sparse, Inherently Brittle
• Embedding Space
– Very High Algorithmic Utility
– Strong Privacy Guarantees
– Dense, Generically More Generalizable
Why are embeddings helpful?
“This representation would still require humans to annotate PHI (as this
is the training data for the task) but the pseudonymization step would be
performed by the transformation to the representation. A tool trained on
the representation could easily be made publicly available because its
parameters cannot contain any protected data, as it is never trained on
raw text.”
- Max Friedrich (Data Protection Compliant
Exchange of Training Data for Automatic De-Identification
of Medical Text) November 20, 2018
https://www.inf.uni-hamburg.de/en/inst/ab/lt/teaching/theses/completed-
theses/2018-ma-friedrich.pdf
King
Queen
- man
+ woman
(Royalty)
How do Embeddings Work?
• Meaning is “encoded” into
the embedding space
• Individual dimensions are not
human interpretable
• Embedding method learns by
examining large corpora of
generic language
• Goal is accurate language
representation as a proxy for
downstream performance
“Word” Embeddings
Examples
• Word2vec
• GloVe
• fastText
“Word” Embeddings
Token Value
“great” [0.1, 0.3,
…]
… …
Examples In Practice
• Word2vec
• GloVe
• fastText
“Word” Embeddings
Token Value
“great” [0.1, 0.3,
…]
… …
Examples In Practice
Training
The quick brown fox _____ over the lazy dog
___ ___ ____ ___ jumps ___ __ ___ ___
CBOW
Skip Gram
• Word2vec
• GloVe
• fastText
Do They Really Preserve Algorithmic Value?
0.5
0.55
0.6
0.65
0.7
0.75
0.8
50 75 100 125 150 175 200 225 250 275 300
Word2vec Benchmark (IMDB Sentiment Analysis)
tf-idf
word2vec
• Embeddings generally
outperform raw text at low
data volumes
• Leveraging large, generic
text corpora improves
generalizability
• This is 6 year old tech.
Embeddings have improved
drastically. Text has not.
Reported numbers are the average of 5 runs of randomly sampled test/train
splits each reporting the average of a 5-fold cv, within which Logistic
Regression hyperparameters are optimized. Generated using Enso
Do They Really Preserve Algorithmic Value?
• Embeddings generally
outperform raw text at low data
volumes
• Leveraging large, generic text
corpora improves
generalizability
• This is 4 year old tech.
Embeddings have improved
drastically. Text has not.
Reported numbers are the average of 5 runs of randomly sampled test/train splits
each reporting the average of a 5-fold cv, within which Logistic Regression
hyperparameters are optimized. Generated using Enso
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
50
75
100
125
150
175
200
225
250
275
300
325
350
375
400
425
450
475
500
Accuracy
Number of Data Points
Glove Benchmark (Rotten Tomatoes
Sentiment Analysis)
tf-idf
Glove
Text Embeddings
Examples
• Doc2vec
• Elmo
• ULMFiT
Text Embeddings
Examples
In Practice
Often built on top of pre-trained word embeddings
• Doc2vec
• Elmo
• ULMFiT
Text Embeddings
Examples In Practice
Training
The quick brown fox jumps over the lazy
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Language
Supervised
dog
True
Often built on top of pre-trained word embeddings
• Doc2vec
• Elmo
• ULMFiT
Text Embeddings
CNN-Style
The quick brown fox jumps over the lazy
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Prediction
https://arxiv.org/pdf/1408.5882.pdf
Example
Text Embeddings
RNN-Style
The quick brown fox jumps over the lazy
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Output
Memory
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.5
0.3
0.4
0.2
0.1
0.1
0.9
0.3
…
0.3
0.5
0.6
0.2
0.4
0.4
0.9
0.9
…
0.1
0.1
0.2
0.3
0.9
0.5
0.1
0.5
…
0.8
0.3
0.7
0.9
0.7
0.9
0.3
0.5
…
0.9
0.5
0.5
0.6
0.5
0.2
0.8
0.1
…
0.9
0.6
0.1
0.7
0.4
0.4
0.8
0.1
…
0.6
0.5
0.8
0.6
0.9
0.6
0.3
0.1
…
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
…
σ σ σ σ σ σ σ σ
Prediction
https://arxiv.org/pdf/1802.05365.pdf
Example
Additional Notes
• “Word” Embeddings
– May not correspond directly to words
– Many excellent public implementations
– Vulnerable to rainbow-table attacks (Do not store)
• Text Embeddings
– Often built on top of word embeddings
– K-anonymity can be directly assessed
– Naively implemented as mean of word embeddings
Embedding Options
• Public
– Ready-to-use downloadable word vectors
– Trained on public datasets (e.g. word2vec trained on CommonCrawl)
• Homegrown
– Embedding technique trained in-house
– Typically a straightforward algorithm on a novel data corpus
• Third Party
– Sourced from vendor responsible for updating data sources and
algorithms
Embedding Options (Public)
Notes
Ease of Use Algorithmic
Advantage
Intrinsic
Privacy
Efficacy Transparency
& Control
Low
Medium
High
• Public embeddings effectively mean
public rainbow tables
• Licensing is often ambiguous or
incompatible with commercial use
Embedding Options (Homegrown)
Notes
Ease of Use Algorithmic
Advantage
Intrinsic
Privacy
Efficacy Transparency
& Control
Low
Medium
High
• Deceptively easy to start, extremely
difficult to effectively use in production
• While this can theoretically be a good
approach, for most organizations this
is a bad idea.
Embedding Options (Third party)
Notes
Ease of Use Algorithmic
Advantage
Intrinsic
Privacy
Efficacy Transparency
& Control
Low
Medium
High
• Data leakage can result in handing
away competitive edge
• Efficacy highly variable. Ensure robust
benchmarks before working with any
third party
• Privacy-controlled environment makes
involving a third party contractually
difficult
State of Embeddings in Spark
• Training yourself
– Reasonable word2vec implementations
– Proper evaluation requires two nested CV loops + a private
holdout at a minimum
• Basic NLP Functionality
– John Snow NLP Library (https://nlp.johnsnowlabs.com/)
• Pretrained Embeddings
– Compute offline in python (Gensim, Spacy, indico)
Summary
• Privacy-preserving computation is important
• De-identification is easy with Spark
• Beware centralized sanitization
• Embeddings are great for privacy
Interested in challenging data engineering, ML & AI on very big data?
We’d love to hear from you. sim@swoop.com / slater@indico.io
Privacy matters. Thank you for caring.

More Related Content

Similar to Great Models with Great Privacy: Optimizing ML and AI Over Sensitive Data

Fix What Matters
Fix What MattersFix What Matters
Fix What MattersEd Bellis
 
Why care about application security
Why care about application securityWhy care about application security
Why care about application securityPawel Krawczyk
 
MACHINE LEARNING: BEYOND DIAGNOSIS II—2018
MACHINE LEARNING: BEYOND DIAGNOSIS II—2018MACHINE LEARNING: BEYOND DIAGNOSIS II—2018
MACHINE LEARNING: BEYOND DIAGNOSIS II—2018SMARTMD
 
205250 crystall ball
205250 crystall ball205250 crystall ball
205250 crystall ballp6academy
 
Joint Presentation on The State of Cybersecurity ('15-'16) & Third Party Cyb...
Joint Presentation on The State of Cybersecurity ('15-'16) & Third Party  Cyb...Joint Presentation on The State of Cybersecurity ('15-'16) & Third Party  Cyb...
Joint Presentation on The State of Cybersecurity ('15-'16) & Third Party Cyb...Rishi Singh
 
Data Anonymization For Better Software Testing
Data Anonymization For Better Software TestingData Anonymization For Better Software Testing
Data Anonymization For Better Software TestingCloverDX
 
infoShare 2011 - Paweł Krawczyk - Why care about application security (open)
infoShare 2011 - Paweł Krawczyk - Why care about application security (open)infoShare 2011 - Paweł Krawczyk - Why care about application security (open)
infoShare 2011 - Paweł Krawczyk - Why care about application security (open)Infoshare
 
Dragos and CyberWire: ICS Ransomware
Dragos and CyberWire: ICS Ransomware Dragos and CyberWire: ICS Ransomware
Dragos and CyberWire: ICS Ransomware Dragos, Inc.
 
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)Shift Conference
 
Crush Common Cybersecurity Threats with Privilege Access Management
Crush Common Cybersecurity Threats with Privilege Access ManagementCrush Common Cybersecurity Threats with Privilege Access Management
Crush Common Cybersecurity Threats with Privilege Access ManagementBeyondTrust
 
Angus McCann, IBM. Presentation at Health-Tech Innovation LABS Conference 18....
Angus McCann, IBM. Presentation at Health-Tech Innovation LABS Conference 18....Angus McCann, IBM. Presentation at Health-Tech Innovation LABS Conference 18....
Angus McCann, IBM. Presentation at Health-Tech Innovation LABS Conference 18....Health-Tech Innovation LABS
 
Optim test data management for IMS 2011
Optim test data management for IMS 2011Optim test data management for IMS 2011
Optim test data management for IMS 2011evgeni77
 
WONDERing about health data
WONDERing about health dataWONDERing about health data
WONDERing about health dataLiveStories
 
Cybersecurity-for-Industrial-Control-Systems_joa_Eng_0115
Cybersecurity-for-Industrial-Control-Systems_joa_Eng_0115Cybersecurity-for-Industrial-Control-Systems_joa_Eng_0115
Cybersecurity-for-Industrial-Control-Systems_joa_Eng_0115A Krista Kivisild
 
11 19-2015 - iasaca membership conference - the state of security
11 19-2015 - iasaca membership conference - the state of security11 19-2015 - iasaca membership conference - the state of security
11 19-2015 - iasaca membership conference - the state of securityMatthew Pascucci
 
Accure ai healthcare offering v4
Accure ai healthcare offering v4Accure ai healthcare offering v4
Accure ai healthcare offering v4Accureinc
 

Similar to Great Models with Great Privacy: Optimizing ML and AI Over Sensitive Data (20)

Information Security For Small Business
Information Security For Small BusinessInformation Security For Small Business
Information Security For Small Business
 
Fix What Matters
Fix What MattersFix What Matters
Fix What Matters
 
Why care about application security
Why care about application securityWhy care about application security
Why care about application security
 
MACHINE LEARNING: BEYOND DIAGNOSIS II—2018
MACHINE LEARNING: BEYOND DIAGNOSIS II—2018MACHINE LEARNING: BEYOND DIAGNOSIS II—2018
MACHINE LEARNING: BEYOND DIAGNOSIS II—2018
 
205250 crystall ball
205250 crystall ball205250 crystall ball
205250 crystall ball
 
Joint Presentation on The State of Cybersecurity ('15-'16) & Third Party Cyb...
Joint Presentation on The State of Cybersecurity ('15-'16) & Third Party  Cyb...Joint Presentation on The State of Cybersecurity ('15-'16) & Third Party  Cyb...
Joint Presentation on The State of Cybersecurity ('15-'16) & Third Party Cyb...
 
Data Anonymization For Better Software Testing
Data Anonymization For Better Software TestingData Anonymization For Better Software Testing
Data Anonymization For Better Software Testing
 
infoShare 2011 - Paweł Krawczyk - Why care about application security (open)
infoShare 2011 - Paweł Krawczyk - Why care about application security (open)infoShare 2011 - Paweł Krawczyk - Why care about application security (open)
infoShare 2011 - Paweł Krawczyk - Why care about application security (open)
 
Dragos and CyberWire: ICS Ransomware
Dragos and CyberWire: ICS Ransomware Dragos and CyberWire: ICS Ransomware
Dragos and CyberWire: ICS Ransomware
 
Data Leakage Prevention - K. K. Mookhey
Data Leakage Prevention - K. K. MookheyData Leakage Prevention - K. K. Mookhey
Data Leakage Prevention - K. K. Mookhey
 
Ban Smart Card Mahasweta
Ban Smart Card MahaswetaBan Smart Card Mahasweta
Ban Smart Card Mahasweta
 
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
 
Crush Common Cybersecurity Threats with Privilege Access Management
Crush Common Cybersecurity Threats with Privilege Access ManagementCrush Common Cybersecurity Threats with Privilege Access Management
Crush Common Cybersecurity Threats with Privilege Access Management
 
Angus McCann, IBM. Presentation at Health-Tech Innovation LABS Conference 18....
Angus McCann, IBM. Presentation at Health-Tech Innovation LABS Conference 18....Angus McCann, IBM. Presentation at Health-Tech Innovation LABS Conference 18....
Angus McCann, IBM. Presentation at Health-Tech Innovation LABS Conference 18....
 
Randy Goebel for the KIEF 2018. FROM DATA TO ECONOMIC VALUE
Randy Goebel for the KIEF 2018. FROM DATA TO ECONOMIC VALUERandy Goebel for the KIEF 2018. FROM DATA TO ECONOMIC VALUE
Randy Goebel for the KIEF 2018. FROM DATA TO ECONOMIC VALUE
 
Optim test data management for IMS 2011
Optim test data management for IMS 2011Optim test data management for IMS 2011
Optim test data management for IMS 2011
 
WONDERing about health data
WONDERing about health dataWONDERing about health data
WONDERing about health data
 
Cybersecurity-for-Industrial-Control-Systems_joa_Eng_0115
Cybersecurity-for-Industrial-Control-Systems_joa_Eng_0115Cybersecurity-for-Industrial-Control-Systems_joa_Eng_0115
Cybersecurity-for-Industrial-Control-Systems_joa_Eng_0115
 
11 19-2015 - iasaca membership conference - the state of security
11 19-2015 - iasaca membership conference - the state of security11 19-2015 - iasaca membership conference - the state of security
11 19-2015 - iasaca membership conference - the state of security
 
Accure ai healthcare offering v4
Accure ai healthcare offering v4Accure ai healthcare offering v4
Accure ai healthcare offering v4
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 

Recently uploaded (20)

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 

Great Models with Great Privacy: Optimizing ML and AI Over Sensitive Data

  • 1. Great Models with Great Privacy Optimizing ML & AI Over Sensitive Data Sim Simeonov, CTO, Swoop sim@swoop.com / @simeons Slater Victoroff, CTO, Indico slater@indico.io / @sl8rv #UnifiedAnalytics #SparkAISummit
  • 2. privacy-preserving ML/AI to improve patient outcomes and pharmaceutical operations e.g., we improve the diagnosis rate of rare diseases
  • 3. Intelligent Process Automation for Unstructured Content using ML/AI with strong data protection guarantees e.g. we automate the processing of loan documents using a combination of NLP and Computer Vision
  • 4. Regulation affects ML & AI • General Data Protection Regulation (GDPR) – Already in effect in the EU • California Consumer Protection Act (CCPA) – Comes into effect Jan 1, 2020 • Many federal efforts under way – Information Transparency and Data Control Act (DelBene, D-WA) – Consumer Data Protection Act (Wyden, D-OR) – Data Care Act (Schatz, D-HI)
  • 5. accuracy vs. privacy is a false dichotomy (if you are willing to invest in privacy infrastructure)
  • 6. Privacy-preserving computation frontiers • Stochastic – Differential privacy (DP) • Encryption-based – Fully homomorphic encryption (FHE) • Protocol-based – Secure multi-party computation (SMC)
  • 7. When privacy-preserving algorithms are immature, sanitize the data the algorithms are trained on
  • 8. Session roadmap • The rest of this session – Sanitizing a single dataset using Spark • After the break – Sanitizing joinable datasets the smart way – Embeddings for unstructured data (deep learning!)
  • 9. What does it mean to sanitize data for privacy?
  • 10. Identifiability spectrum Personally-Identified Information (PII) Device-Identified Information (DII)De-Identified Information
  • 11. Direct identifiability • Personal information (PI) • Sanitize with pseudonymous identifiers – Secure one-way mapping of PI to opaque IDs Sim Simeonov; Male; July 7, 1977 One Swoop Way, Cambridge, MA 02140
  • 12. Secure pseudonymous ID generation Sim|Simeonov|M|1977-07-07|02140 8daed4fa67a07d7a5 … 6f574021 gPGIoVw … nNpij1LveZRtKeWU= Sim Simeonov; Male; July 7, 1977 One Swoop Way, Cambridge, MA 02140 // canonical representation // secure destructive hashing (SHA-xxx) // master encryption (AES-xxx) Vw50jZjh6BCWUz … mfUFtyGZ3q // partner A encryption 6ykWEv7A2lis8 … VT2ZddaOeML // partner B encryption Simeon Simeonov; M; 1977-07-07 One Swoop Way, Suite 305, Cambridge, MA 02140 ...
  • 13. Dealing with dirty data Sim|Simeonov|M|1977-07-07|02140 // full entry when data is clean S|S551|M|1977-07-07|02140 // fuzzify names to handle limited entry & typos Sim|Simeonov|M|1977-07|02140 // generalize structured fields (dates, locations, …) tune fuzzification to use cases & desired FP/FN rates
  • 15. Indirect identifiability via quasi-identifiers
  • 16. Sanitizing quasi-identifiers • k-anonymity – Generalize or suppress quasi-identifiers – Any record is “similar” to at least k-1 other records • (k, ℇ)-anonymity – adds noise
  • 17. Sanitizing quasi-identifiers in Spark • Optimal k-anonymity is an NP-hard problem – Mondrian algorithm: greedy O(nlogn) approximation https://github.com/eubr-bigsea/k-anonymity-mondrian • Active research – Locale-sensitive hashing (LSH) improvements – Risk-based approaches (LBS algorithm) – Academic Spark implementations (TDS, MDSBA)
  • 18. Sanitizing joinable datasets • The industry standard: centralized sanitization • Sanitize the data as if it were joined together • Big increase in quasi-identifiers
  • 19. We show that when the data contains a large number of attributes which may be considered quasi-identifiers, it becomes difficult to anonymize the data without an unacceptably high amount of information loss. ... we are faced with ... either completely suppressing most of the data or losing the desired level of anonymity. On k-Anonymity and the Curse of Dimensionality 2005 Aggarwal, C. @ IBM T. J. Watson Research Center The curse of dimensionality for sanitization
  • 20. Curse of dimensionality example • Titanic passenger data • k-anonymize for different values of k by – Age – Age & gender • Compute normalized certainty penalty (NCP) – Higher values mean more information loss
  • 21. Normalized Certainty Penalty 0% 5% 10% 15% 20% 25% 30% 35% 40% 2 3 4 5 6 7 8 9 10 k age gender & age k-anonymizing Titanic passenger survivability 300%
  • 22. Centralized sanitization describes an alternate reality
  • 23. We find that for privacy budgets effective at preventing attacks, patients would be exposed to increased risk of stroke, bleeding events, and mortality. Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing 2014 Fredrikson, M. et. al. @ UW Madison and Marshfield Clinic Research Foundation Centralized sanitization increases risk
  • 24. High-accuracy ML/AI requires federated sanitization
  • 25. Federated sanitization (Swoop’s prAIvacy™) • Isolated execution environments • Workflows execute across EEs • Task-specific sanitization firewalls • Sanitization is often lossless Model condition X without mixing health data with other data
  • 26. There is no standard anonymization framework for unstructured data: suppress or structure or ???
  • 27. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation.
  • 28. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem PII
  • 29. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem PII Solution(s) • Remove common names? • Tell Doctors to stop using names in their notes? • Lookup patient information in notes and intentional remove
  • 30. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem Quasi-Identifiers
  • 31. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem Quasi-Identifiers Solution(s) • Remove all locations? • Remove all gendered pronouns? • Remove all numbers?
  • 32. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem Weak Identifiers
  • 33. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem Weak Identifiers Solution(s) • Remove all text
  • 34. The Real Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem Predictive/Clinical Value
  • 35. The Real Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem Predictive/Clinical Value Accuracy Matters
  • 36. The Real Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Problem Predictive/Clinical Value Accuracy Matters Solution(s) • Embeddings
  • 37. What’s Changed? • I2b2 dataset for de-identification (PII & Quasi-identifiers) – Standard NLP (Part of speech tagging, rules-based, keywords) • 0.91 F1 after months of writing rules [1, 2] • Drops to 0.79 F1 on new data (Unusable) [1, 2] – Modern NLP (Deep learning, embeddings) • 0.98 F1 without any rule-writing. (Roughly a 10x error reduction) [1,3] • >0.99 F1 on new data [1,3] [1]: Max Friedrich. Data Protection Compliant Exchange of Training Data for Automatic De-Identification of Medical Text November 20, 2018 https://www.inf.uni- hamburg.de/en/inst/ab/lt/teaching/theses/completed-theses/2018-ma-friedrich.pdf [2]: Amber Stubbs, Michele Filannino, and Ozlem Uzuner. De-identification of psychiatric ¨ intake records: Overview of 2016 CEGS N-GRID shared tasks track 1. Journal of Biomedical Informatics, 75(S):S4–S18, November 2017. ISSN 1532-0464. URL https://doi.org/10.1016/j.jbi.2017.06.011. [3]: Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits. De-identification of patient notes with recurrent neural networks. Journal of the American Medical References Informatics Association, 24(3):596–606, 2017b. URL https://doi.org/10.1093/ jamia/ocw156.
  • 38. What is an Embedding? Text Space (e.g. English) Embedding Space (e.g. R300) Embedding Method (e.g. Word2Vec) 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 …
  • 39. What is an Embedding? Text Space (e.g. English) Embedding Space (e.g. R300) 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … Embedding Method (e.g. Word2Vec)
  • 41. What is an Embedding? • Text Space – Full Algorithmic Utility – Limited to No Guarantees on Privacy – Sparse, Inherently Brittle • Embedding Space – Very High Algorithmic Utility – Strong Privacy Guarantees – Dense, Generically More Generalizable
  • 42. Why are embeddings helpful? “This representation would still require humans to annotate PHI (as this is the training data for the task) but the pseudonymization step would be performed by the transformation to the representation. A tool trained on the representation could easily be made publicly available because its parameters cannot contain any protected data, as it is never trained on raw text.” - Max Friedrich (Data Protection Compliant Exchange of Training Data for Automatic De-Identification of Medical Text) November 20, 2018 https://www.inf.uni-hamburg.de/en/inst/ab/lt/teaching/theses/completed- theses/2018-ma-friedrich.pdf
  • 43. King Queen - man + woman (Royalty) How do Embeddings Work? • Meaning is “encoded” into the embedding space • Individual dimensions are not human interpretable • Embedding method learns by examining large corpora of generic language • Goal is accurate language representation as a proxy for downstream performance
  • 45. “Word” Embeddings Token Value “great” [0.1, 0.3, …] … … Examples In Practice • Word2vec • GloVe • fastText
  • 46. “Word” Embeddings Token Value “great” [0.1, 0.3, …] … … Examples In Practice Training The quick brown fox _____ over the lazy dog ___ ___ ____ ___ jumps ___ __ ___ ___ CBOW Skip Gram • Word2vec • GloVe • fastText
  • 47. Do They Really Preserve Algorithmic Value? 0.5 0.55 0.6 0.65 0.7 0.75 0.8 50 75 100 125 150 175 200 225 250 275 300 Word2vec Benchmark (IMDB Sentiment Analysis) tf-idf word2vec • Embeddings generally outperform raw text at low data volumes • Leveraging large, generic text corpora improves generalizability • This is 6 year old tech. Embeddings have improved drastically. Text has not. Reported numbers are the average of 5 runs of randomly sampled test/train splits each reporting the average of a 5-fold cv, within which Logistic Regression hyperparameters are optimized. Generated using Enso
  • 48. Do They Really Preserve Algorithmic Value? • Embeddings generally outperform raw text at low data volumes • Leveraging large, generic text corpora improves generalizability • This is 4 year old tech. Embeddings have improved drastically. Text has not. Reported numbers are the average of 5 runs of randomly sampled test/train splits each reporting the average of a 5-fold cv, within which Logistic Regression hyperparameters are optimized. Generated using Enso 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 Accuracy Number of Data Points Glove Benchmark (Rotten Tomatoes Sentiment Analysis) tf-idf Glove
  • 50. Text Embeddings Examples In Practice Often built on top of pre-trained word embeddings • Doc2vec • Elmo • ULMFiT
  • 51. Text Embeddings Examples In Practice Training The quick brown fox jumps over the lazy 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … Language Supervised dog True Often built on top of pre-trained word embeddings • Doc2vec • Elmo • ULMFiT
  • 52. Text Embeddings CNN-Style The quick brown fox jumps over the lazy 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … Prediction https://arxiv.org/pdf/1408.5882.pdf Example
  • 53. Text Embeddings RNN-Style The quick brown fox jumps over the lazy 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … Output Memory 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.5 0.3 0.4 0.2 0.1 0.1 0.9 0.3 … 0.3 0.5 0.6 0.2 0.4 0.4 0.9 0.9 … 0.1 0.1 0.2 0.3 0.9 0.5 0.1 0.5 … 0.8 0.3 0.7 0.9 0.7 0.9 0.3 0.5 … 0.9 0.5 0.5 0.6 0.5 0.2 0.8 0.1 … 0.9 0.6 0.1 0.7 0.4 0.4 0.8 0.1 … 0.6 0.5 0.8 0.6 0.9 0.6 0.3 0.1 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … σ σ σ σ σ σ σ σ Prediction https://arxiv.org/pdf/1802.05365.pdf Example
  • 54. Additional Notes • “Word” Embeddings – May not correspond directly to words – Many excellent public implementations – Vulnerable to rainbow-table attacks (Do not store) • Text Embeddings – Often built on top of word embeddings – K-anonymity can be directly assessed – Naively implemented as mean of word embeddings
  • 55. Embedding Options • Public – Ready-to-use downloadable word vectors – Trained on public datasets (e.g. word2vec trained on CommonCrawl) • Homegrown – Embedding technique trained in-house – Typically a straightforward algorithm on a novel data corpus • Third Party – Sourced from vendor responsible for updating data sources and algorithms
  • 56. Embedding Options (Public) Notes Ease of Use Algorithmic Advantage Intrinsic Privacy Efficacy Transparency & Control Low Medium High • Public embeddings effectively mean public rainbow tables • Licensing is often ambiguous or incompatible with commercial use
  • 57. Embedding Options (Homegrown) Notes Ease of Use Algorithmic Advantage Intrinsic Privacy Efficacy Transparency & Control Low Medium High • Deceptively easy to start, extremely difficult to effectively use in production • While this can theoretically be a good approach, for most organizations this is a bad idea.
  • 58. Embedding Options (Third party) Notes Ease of Use Algorithmic Advantage Intrinsic Privacy Efficacy Transparency & Control Low Medium High • Data leakage can result in handing away competitive edge • Efficacy highly variable. Ensure robust benchmarks before working with any third party • Privacy-controlled environment makes involving a third party contractually difficult
  • 59. State of Embeddings in Spark • Training yourself – Reasonable word2vec implementations – Proper evaluation requires two nested CV loops + a private holdout at a minimum • Basic NLP Functionality – John Snow NLP Library (https://nlp.johnsnowlabs.com/) • Pretrained Embeddings – Compute offline in python (Gensim, Spacy, indico)
  • 60. Summary • Privacy-preserving computation is important • De-identification is easy with Spark • Beware centralized sanitization • Embeddings are great for privacy
  • 61. Interested in challenging data engineering, ML & AI on very big data? We’d love to hear from you. sim@swoop.com / slater@indico.io Privacy matters. Thank you for caring.