SlideShare a Scribd company logo
High-accuracy ML & AI
over sensitive data
Simeon Simeonov, Swoop
@simeons / sim at swoop dot com
omni-channel marketing for your ideal population
supported by privacy-preserving ML/AI
e.g., we improve health outcomes by increasing the
diagnosis rate of rare diseases through doctor/patient education
Swoop & IPM.ai data for 300+M people
• Anonymized patient data
• Online activity
• Imprecise location data
• Demographics, psychographics, purchase behavior, …
Privacy by design: HIPAA-compliant prAIvacy™ platform.
Trusted by the largest pharma companies. GDPR compliant.
Privacy-preserving computation frontiers
• Stochastic
– Differential privacy
• Encryption-based
– Fully homomorphic encryption
• Protocol-based
– Secure multi-party computation (SMC)
When privacy-preserving algorithms are immature,
sanitize the data the algorithms are trained on
Privacy concerns stem from identifiability
• Direct (via personally-identifiable information)
• Indirect (via quasi-identifiers)
Sim Simeonov; Male; July 7, 1977
One Swoop Way, Cambridge, MA 02140
Addressing identifiability in a single dataset
• Direct
– Generate secure pseudonymous identifiers
– Often uses clean room to process PII
• Indirect
– Sanitize quasi-identifiers to desired anonymity trade-offs
– Control data enhancement to maintain anonymity
anonymity == indistinguishability
Sanitizing quasi-identifiers
• Deterministic
– Generalize or suppress quasi-identifiers
– k-anonymity + derivatives
• any given record maps onto at least k-1 other records
• Stochastic
– Add noise to data
– (k, ℇ)-anonymity
• Domain-specific
Addressing identifiability across datasets
• Centralized approach
– Join all data + sanitize the whole
– Big increase in dimensionality
• Federated approach
– Keep data separate + sanitize operations across data
– Smallest possible increase in dimensionality
We show that when the data contains a large number of attributes which may be
considered quasi-identifiers, it becomes difficult to anonymize the data without an
unacceptably high amount of information loss. ... we are faced with ... either
completely suppressing most of the data or losing the desired level of anonymity.
On k-Anonymity and the Curse of Dimensionality
2005 Aggarwal, C. @ IBM T. J. Watson Research Center
Centralized sanitization hurts ML/AI accuracy
We find that for privacy budgets effective at preventing attacks,
patients would be exposed to increased risk of stroke,
bleeding events, and mortality.
Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing
2014 Fredrikson, M. et. al. @ UW Madison and Marshfield Clinic Research Foundation
Centralized sanitization increases risk
Normalized Certainty Penalty (NCP)
0%
5%
10%
15%
20%
25%
30%
35%
40%
2 3 4 5 6 7 8 9 10
k age gender & age
k-anonymizing Titanic passenger survivability
Federated sanitization: Swoop’s prAIvacy™
• Secure, isolated data pools
• Automated sanitization
• Min dimensionality growth
• Deterministic + stochastic
• Optimal + often lossless
Model condition X
score on other data
Putting it all to practice (using Spark)
• Pre-process data
• Generate secure pseudonymous identifiers
• Sanitize quasi-identifiers
dirty quasi-identifiers increase distinguishability:
clean data before sanitization
to prevent increased sanitization loss
no anonymization framework for unstructured data:
suppress or structure
Word embedding for text anonymization
• Text ➞ high-dimensionality vector
– Capture semantics
“Texas” + “Milwaukee” – “Wisconsin” ≃ “Dallas”
– ML/AI-friendly representation
– word2vec, doc2vec, GloVe, …
• Anonymizing embeddings
– Train secret embeddings model
– Add noise to vectors
Secure pseudonymous ID generation
Sim|Simeonov|M|1977-07-07|02140
8daed4fa67a07d7a5 … 6f574021
gPGIoVw … wnNpij1LveZRtKeWU=
Sim Simeonov; Male; July 7, 1977
One Swoop Way, Cambridge, MA 02140
// consistent serialization
// secure destructive hashing (SHA-xxx)
// master encryption (AES-xxx)
Vw50jZjh6BCWUzSVu … mfUFtyGZ3q // partner A encryption
6ykWEv7A2lisz8KUi … VT2ZddaOeML // partner B encryption
Sim Simeonov; M; 1977-07-07
One Swoop Way, Suite 305, Cambridge, MA 02140
...
Multiple IDs for dirty data
Sim|Simeonov|M|1977-07-07|02140 // full entry when data is clean
S|S551|M|1977-07-07|02140 // fuzzify names to handle limited entry & typos
Sim|Simeonov|M|1977-07|02140 // also may reduce dob/geo accuracy
tune fuzzification to use cases & desired FP/FN rates
Build pseudonymous IDs with Spark
(and sanitize PII-based quasi-identifiers)
We need a few user-defined functions
• Strong secure hash function with very few collisions
– sha256(data) computes SHA-256
• Strong symmetric key encryption
– aes_encrypt(data, secret) in Hive but not ported to Spark
– aes__encrypt(data, secret) is a UDF to avoid name conflict
• Demo sugar to build secrets from pass phrases
– secret(pass_phrase)
Let’s create some PII
case class PII(firstName: String, lastName: String,
gender: String, dob: String, zip: String)
val sim = PII("Sim", "Simeonov", "M", "1977-07-07", "02140")
val ids = spark.createDataset(Seq(sim))
Consistent serialization
val p = lit("|") // just a pipe symbol to save us typing
lazy val idRules = Seq(
// Rule 1: Use all PII
concat(upper('firstName), p, upper('lastName), p, 'gender, p, 'dob, p, 'zip),
// Rule 2: Use only first initial of first name and soundex of last name
concat(upper('firstName.substr(1, 1)), p, soundex(upper('lastName)), p,
'gender, p, 'dob, p, 'zip)
)
Hash & encrypt
// The pseudonymous ID columns built from the rules
lazy val psids = {
val masterPassword = "Master Password" // master password to encrypt IDs with
// Serialize -> Hash -> Encrypt
idRules.zipWithIndex.map { case (serialization, idx) =>
aes__encrypt(sha256(serialization), secret(lit(masterPassword)))
.as(s"psid${idx + 1}")
}
}
PII-based quasi-identifiers
// Generalization of quasi-identifying columns
lazy val quasiIdCols: Seq[Column] = Seq(
'gender,
'dob.substr(1, 4).cast(IntegerType).as("yob"), // only year of birth
'zip.substr(1, 3).cast(IntegerType).as("zip3") // only first 3 digits of zip
)
Generate master IDs
// Master pseudonymous IDs
lazy val masterIds = ids.select(quasiIdCols ++ psids: _*)
Generate per partner IDs
val partnerPasswords = Map("A" -> "A Password", "B" -> "B Password")
val partnerIds = spark.createDataset(partnerPasswords.toSeq)
.toDF("partner_name", "pwd").withColumn("pwd", secret('pwd))
.crossJoin(masterIds)
.transform { df =>
psids.indices.foldLeft(df) { case (current, idx) =>
val colName = s"psid${idx + 1}"
current.withColumn(colName, base64(aes__encrypt(col(colName), 'pwd)))
}
}
.drop("pwd")
The end result
Sanitizing quasi-identifiers in Spark
• Optimal k-anonymity is an NP-hard problem
– Mondrian algorithm: greedy O(nlogn) approximation
• https://github.com/eubr-bigsea/k-anonymity-mondrian
• Active research
– Locale-sensitive hashing (LSH) improvements
– Risk-based approaches (e.g., LBS algorithm)
Interested in challenging data engineering, ML & AI on petabytes of data?
I’d love to hear from you. @simeons / sim at swoop dot com
https://databricks.com/session/great-models-with-great-privacy-optimizing-ml-ai-under-gdpr
https://databricks.com/session/the-smart-data-warehouse-goal-based-data-production
https://swoop-inc.github.io/spark-records/
Privacy matters. Thank you for caring.

More Related Content

Similar to High accuracy ML & AI over sensitive data

Privacy-Preserving Data Analysis, Adria Gascon
Privacy-Preserving Data Analysis, Adria GasconPrivacy-Preserving Data Analysis, Adria Gascon
Privacy-Preserving Data Analysis, Adria Gascon
Ulrik Lyngs
 
Cryptography Basics
Cryptography BasicsCryptography Basics
Cryptography Basics
Ali Sadhik Shaik
 
Cobit 2
Cobit 2Cobit 2
Cobit 2
Securelogy
 
Main Menu
Main MenuMain Menu
Main Menu
Securelogy
 
Great Models with Great Privacy: Optimizing ML and AI Over Sensitive Data
Great Models with Great Privacy: Optimizing ML and AI Over Sensitive DataGreat Models with Great Privacy: Optimizing ML and AI Over Sensitive Data
Great Models with Great Privacy: Optimizing ML and AI Over Sensitive Data
Databricks
 
BigData and Privacy webinar at Brighttalk
BigData and Privacy webinar at BrighttalkBigData and Privacy webinar at Brighttalk
BigData and Privacy webinar at Brighttalk
Ulf Mattsson
 
CIS14: Authentication Family Tree (1.1.1 annotated) - Steve Wilson
CIS14: Authentication Family Tree (1.1.1 annotated) - Steve WilsonCIS14: Authentication Family Tree (1.1.1 annotated) - Steve Wilson
CIS14: Authentication Family Tree (1.1.1 annotated) - Steve Wilson
CloudIDSummit
 
Biometrics and Multi-Factor Authentication, The Unleashed Dragon
Biometrics and Multi-Factor Authentication, The Unleashed DragonBiometrics and Multi-Factor Authentication, The Unleashed Dragon
Biometrics and Multi-Factor Authentication, The Unleashed Dragon
Clare Nelson, CISSP, CIPP-E
 
Digital Defense for Activists (and the rest of us)
Digital Defense for Activists (and the rest of us)Digital Defense for Activists (and the rest of us)
Digital Defense for Activists (and the rest of us)
Michele Chubirka
 
Biometric Recognition for Authentication, BSides Austin, May 2017
Biometric Recognition for Authentication, BSides Austin, May 2017Biometric Recognition for Authentication, BSides Austin, May 2017
Biometric Recognition for Authentication, BSides Austin, May 2017
Clare Nelson, CISSP, CIPP-E
 
Learn Ethical Hacking in 10 Hours | Ethical Hacking Full Course | Edureka
Learn Ethical Hacking in 10 Hours | Ethical Hacking Full Course | EdurekaLearn Ethical Hacking in 10 Hours | Ethical Hacking Full Course | Edureka
Learn Ethical Hacking in 10 Hours | Ethical Hacking Full Course | Edureka
Edureka!
 
Stopping Breaches at the Perimeter: Strategies for Secure Access Control
Stopping Breaches at the Perimeter: Strategies for Secure Access ControlStopping Breaches at the Perimeter: Strategies for Secure Access Control
Stopping Breaches at the Perimeter: Strategies for Secure Access Control
SecureAuth
 
Data protection on premises, and in public and private clouds
Data protection on premises, and in public and private cloudsData protection on premises, and in public and private clouds
Data protection on premises, and in public and private clouds
Ulf Mattsson
 
Security Training 2008
Security Training 2008Security Training 2008
Security Training 2008
bdill
 
Fraud and Cybersecurity: How are they Related?
Fraud and Cybersecurity: How are they Related?Fraud and Cybersecurity: How are they Related?
Fraud and Cybersecurity: How are they Related?
Institute of Singapore Chartered Accountants
 
PACE-IT, Security+ 6.2: Cryptographic Methods (part 2)
PACE-IT, Security+ 6.2: Cryptographic Methods (part 2)PACE-IT, Security+ 6.2: Cryptographic Methods (part 2)
PACE-IT, Security+ 6.2: Cryptographic Methods (part 2)
Pace IT at Edmonds Community College
 
A UML Profile for Privacy Enforcement
A UML Profile for Privacy EnforcementA UML Profile for Privacy Enforcement
A UML Profile for Privacy Enforcement
Javier Canovas
 
Health Information Privacy and Security (October 30, 2019)
Health Information Privacy and Security (October 30, 2019)Health Information Privacy and Security (October 30, 2019)
Health Information Privacy and Security (October 30, 2019)
Nawanan Theera-Ampornpunt
 
Multi-Biometric Authentication through Hybrid Cryptographic System
Multi-Biometric Authentication through Hybrid Cryptographic SystemMulti-Biometric Authentication through Hybrid Cryptographic System
Multi-Biometric Authentication through Hybrid Cryptographic System
MangaiK4
 
Gdpr encryption and tokenization
Gdpr encryption and tokenizationGdpr encryption and tokenization
Gdpr encryption and tokenization
Ulf Mattsson
 

Similar to High accuracy ML & AI over sensitive data (20)

Privacy-Preserving Data Analysis, Adria Gascon
Privacy-Preserving Data Analysis, Adria GasconPrivacy-Preserving Data Analysis, Adria Gascon
Privacy-Preserving Data Analysis, Adria Gascon
 
Cryptography Basics
Cryptography BasicsCryptography Basics
Cryptography Basics
 
Cobit 2
Cobit 2Cobit 2
Cobit 2
 
Main Menu
Main MenuMain Menu
Main Menu
 
Great Models with Great Privacy: Optimizing ML and AI Over Sensitive Data
Great Models with Great Privacy: Optimizing ML and AI Over Sensitive DataGreat Models with Great Privacy: Optimizing ML and AI Over Sensitive Data
Great Models with Great Privacy: Optimizing ML and AI Over Sensitive Data
 
BigData and Privacy webinar at Brighttalk
BigData and Privacy webinar at BrighttalkBigData and Privacy webinar at Brighttalk
BigData and Privacy webinar at Brighttalk
 
CIS14: Authentication Family Tree (1.1.1 annotated) - Steve Wilson
CIS14: Authentication Family Tree (1.1.1 annotated) - Steve WilsonCIS14: Authentication Family Tree (1.1.1 annotated) - Steve Wilson
CIS14: Authentication Family Tree (1.1.1 annotated) - Steve Wilson
 
Biometrics and Multi-Factor Authentication, The Unleashed Dragon
Biometrics and Multi-Factor Authentication, The Unleashed DragonBiometrics and Multi-Factor Authentication, The Unleashed Dragon
Biometrics and Multi-Factor Authentication, The Unleashed Dragon
 
Digital Defense for Activists (and the rest of us)
Digital Defense for Activists (and the rest of us)Digital Defense for Activists (and the rest of us)
Digital Defense for Activists (and the rest of us)
 
Biometric Recognition for Authentication, BSides Austin, May 2017
Biometric Recognition for Authentication, BSides Austin, May 2017Biometric Recognition for Authentication, BSides Austin, May 2017
Biometric Recognition for Authentication, BSides Austin, May 2017
 
Learn Ethical Hacking in 10 Hours | Ethical Hacking Full Course | Edureka
Learn Ethical Hacking in 10 Hours | Ethical Hacking Full Course | EdurekaLearn Ethical Hacking in 10 Hours | Ethical Hacking Full Course | Edureka
Learn Ethical Hacking in 10 Hours | Ethical Hacking Full Course | Edureka
 
Stopping Breaches at the Perimeter: Strategies for Secure Access Control
Stopping Breaches at the Perimeter: Strategies for Secure Access ControlStopping Breaches at the Perimeter: Strategies for Secure Access Control
Stopping Breaches at the Perimeter: Strategies for Secure Access Control
 
Data protection on premises, and in public and private clouds
Data protection on premises, and in public and private cloudsData protection on premises, and in public and private clouds
Data protection on premises, and in public and private clouds
 
Security Training 2008
Security Training 2008Security Training 2008
Security Training 2008
 
Fraud and Cybersecurity: How are they Related?
Fraud and Cybersecurity: How are they Related?Fraud and Cybersecurity: How are they Related?
Fraud and Cybersecurity: How are they Related?
 
PACE-IT, Security+ 6.2: Cryptographic Methods (part 2)
PACE-IT, Security+ 6.2: Cryptographic Methods (part 2)PACE-IT, Security+ 6.2: Cryptographic Methods (part 2)
PACE-IT, Security+ 6.2: Cryptographic Methods (part 2)
 
A UML Profile for Privacy Enforcement
A UML Profile for Privacy EnforcementA UML Profile for Privacy Enforcement
A UML Profile for Privacy Enforcement
 
Health Information Privacy and Security (October 30, 2019)
Health Information Privacy and Security (October 30, 2019)Health Information Privacy and Security (October 30, 2019)
Health Information Privacy and Security (October 30, 2019)
 
Multi-Biometric Authentication through Hybrid Cryptographic System
Multi-Biometric Authentication through Hybrid Cryptographic SystemMulti-Biometric Authentication through Hybrid Cryptographic System
Multi-Biometric Authentication through Hybrid Cryptographic System
 
Gdpr encryption and tokenization
Gdpr encryption and tokenizationGdpr encryption and tokenization
Gdpr encryption and tokenization
 

More from Simeon Simeonov

HyperLogLog Intuition Without Hard Math
HyperLogLog Intuition Without Hard MathHyperLogLog Intuition Without Hard Math
HyperLogLog Intuition Without Hard Math
Simeon Simeonov
 
Memory Issues in Ruby on Rails Applications
Memory Issues in Ruby on Rails ApplicationsMemory Issues in Ruby on Rails Applications
Memory Issues in Ruby on Rails Applications
Simeon Simeonov
 
Revolutionazing Search Advertising with ElasticSearch at Swoop
Revolutionazing Search Advertising with ElasticSearch at SwoopRevolutionazing Search Advertising with ElasticSearch at Swoop
Revolutionazing Search Advertising with ElasticSearch at Swoop
Simeon Simeonov
 
The Rough Guide to MongoDB
The Rough Guide to MongoDBThe Rough Guide to MongoDB
The Rough Guide to MongoDB
Simeon Simeonov
 
Three Tips for Winning Startup Weekend
Three Tips for Winning Startup WeekendThree Tips for Winning Startup Weekend
Three Tips for Winning Startup Weekend
Simeon Simeonov
 
Swoop: Solve Hard Problems & Fly Robots
Swoop: Solve Hard Problems & Fly RobotsSwoop: Solve Hard Problems & Fly Robots
Swoop: Solve Hard Problems & Fly Robots
Simeon Simeonov
 
Build a Story Factory for Inbound Marketing in Five Easy Steps
Build a Story Factory for Inbound Marketing in Five Easy StepsBuild a Story Factory for Inbound Marketing in Five Easy Steps
Build a Story Factory for Inbound Marketing in Five Easy Steps
Simeon Simeonov
 
Strategies for Startup Success by Simeon Simeonov
Strategies for Startup Success by Simeon SimeonovStrategies for Startup Success by Simeon Simeonov
Strategies for Startup Success by Simeon Simeonov
Simeon Simeonov
 
Patterns of Successful Angel Investing by Simeon Simeonov
Patterns of Successful Angel Investing by Simeon SimeonovPatterns of Successful Angel Investing by Simeon Simeonov
Patterns of Successful Angel Investing by Simeon Simeonov
Simeon Simeonov
 
Customer Development: The Second Decade by Bob Dorf
Customer Development: The Second Decade by Bob DorfCustomer Development: The Second Decade by Bob Dorf
Customer Development: The Second Decade by Bob Dorf
Simeon Simeonov
 
Beyond Bootstrapping
Beyond BootstrappingBeyond Bootstrapping
Beyond Bootstrapping
Simeon Simeonov
 

More from Simeon Simeonov (11)

HyperLogLog Intuition Without Hard Math
HyperLogLog Intuition Without Hard MathHyperLogLog Intuition Without Hard Math
HyperLogLog Intuition Without Hard Math
 
Memory Issues in Ruby on Rails Applications
Memory Issues in Ruby on Rails ApplicationsMemory Issues in Ruby on Rails Applications
Memory Issues in Ruby on Rails Applications
 
Revolutionazing Search Advertising with ElasticSearch at Swoop
Revolutionazing Search Advertising with ElasticSearch at SwoopRevolutionazing Search Advertising with ElasticSearch at Swoop
Revolutionazing Search Advertising with ElasticSearch at Swoop
 
The Rough Guide to MongoDB
The Rough Guide to MongoDBThe Rough Guide to MongoDB
The Rough Guide to MongoDB
 
Three Tips for Winning Startup Weekend
Three Tips for Winning Startup WeekendThree Tips for Winning Startup Weekend
Three Tips for Winning Startup Weekend
 
Swoop: Solve Hard Problems & Fly Robots
Swoop: Solve Hard Problems & Fly RobotsSwoop: Solve Hard Problems & Fly Robots
Swoop: Solve Hard Problems & Fly Robots
 
Build a Story Factory for Inbound Marketing in Five Easy Steps
Build a Story Factory for Inbound Marketing in Five Easy StepsBuild a Story Factory for Inbound Marketing in Five Easy Steps
Build a Story Factory for Inbound Marketing in Five Easy Steps
 
Strategies for Startup Success by Simeon Simeonov
Strategies for Startup Success by Simeon SimeonovStrategies for Startup Success by Simeon Simeonov
Strategies for Startup Success by Simeon Simeonov
 
Patterns of Successful Angel Investing by Simeon Simeonov
Patterns of Successful Angel Investing by Simeon SimeonovPatterns of Successful Angel Investing by Simeon Simeonov
Patterns of Successful Angel Investing by Simeon Simeonov
 
Customer Development: The Second Decade by Bob Dorf
Customer Development: The Second Decade by Bob DorfCustomer Development: The Second Decade by Bob Dorf
Customer Development: The Second Decade by Bob Dorf
 
Beyond Bootstrapping
Beyond BootstrappingBeyond Bootstrapping
Beyond Bootstrapping
 

Recently uploaded

原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
fkyes25
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 

Recently uploaded (20)

原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 

High accuracy ML & AI over sensitive data

  • 1. High-accuracy ML & AI over sensitive data Simeon Simeonov, Swoop @simeons / sim at swoop dot com
  • 2.
  • 3. omni-channel marketing for your ideal population supported by privacy-preserving ML/AI e.g., we improve health outcomes by increasing the diagnosis rate of rare diseases through doctor/patient education
  • 4. Swoop & IPM.ai data for 300+M people • Anonymized patient data • Online activity • Imprecise location data • Demographics, psychographics, purchase behavior, … Privacy by design: HIPAA-compliant prAIvacy™ platform. Trusted by the largest pharma companies. GDPR compliant.
  • 5. Privacy-preserving computation frontiers • Stochastic – Differential privacy • Encryption-based – Fully homomorphic encryption • Protocol-based – Secure multi-party computation (SMC)
  • 6. When privacy-preserving algorithms are immature, sanitize the data the algorithms are trained on
  • 7. Privacy concerns stem from identifiability • Direct (via personally-identifiable information) • Indirect (via quasi-identifiers) Sim Simeonov; Male; July 7, 1977 One Swoop Way, Cambridge, MA 02140
  • 8.
  • 9. Addressing identifiability in a single dataset • Direct – Generate secure pseudonymous identifiers – Often uses clean room to process PII • Indirect – Sanitize quasi-identifiers to desired anonymity trade-offs – Control data enhancement to maintain anonymity anonymity == indistinguishability
  • 10. Sanitizing quasi-identifiers • Deterministic – Generalize or suppress quasi-identifiers – k-anonymity + derivatives • any given record maps onto at least k-1 other records • Stochastic – Add noise to data – (k, ℇ)-anonymity • Domain-specific
  • 11. Addressing identifiability across datasets • Centralized approach – Join all data + sanitize the whole – Big increase in dimensionality • Federated approach – Keep data separate + sanitize operations across data – Smallest possible increase in dimensionality
  • 12. We show that when the data contains a large number of attributes which may be considered quasi-identifiers, it becomes difficult to anonymize the data without an unacceptably high amount of information loss. ... we are faced with ... either completely suppressing most of the data or losing the desired level of anonymity. On k-Anonymity and the Curse of Dimensionality 2005 Aggarwal, C. @ IBM T. J. Watson Research Center Centralized sanitization hurts ML/AI accuracy
  • 13. We find that for privacy budgets effective at preventing attacks, patients would be exposed to increased risk of stroke, bleeding events, and mortality. Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing 2014 Fredrikson, M. et. al. @ UW Madison and Marshfield Clinic Research Foundation Centralized sanitization increases risk
  • 14. Normalized Certainty Penalty (NCP) 0% 5% 10% 15% 20% 25% 30% 35% 40% 2 3 4 5 6 7 8 9 10 k age gender & age k-anonymizing Titanic passenger survivability
  • 15. Federated sanitization: Swoop’s prAIvacy™ • Secure, isolated data pools • Automated sanitization • Min dimensionality growth • Deterministic + stochastic • Optimal + often lossless Model condition X score on other data
  • 16. Putting it all to practice (using Spark) • Pre-process data • Generate secure pseudonymous identifiers • Sanitize quasi-identifiers
  • 17. dirty quasi-identifiers increase distinguishability: clean data before sanitization to prevent increased sanitization loss
  • 18. no anonymization framework for unstructured data: suppress or structure
  • 19. Word embedding for text anonymization • Text ➞ high-dimensionality vector – Capture semantics “Texas” + “Milwaukee” – “Wisconsin” ≃ “Dallas” – ML/AI-friendly representation – word2vec, doc2vec, GloVe, … • Anonymizing embeddings – Train secret embeddings model – Add noise to vectors
  • 20. Secure pseudonymous ID generation Sim|Simeonov|M|1977-07-07|02140 8daed4fa67a07d7a5 … 6f574021 gPGIoVw … wnNpij1LveZRtKeWU= Sim Simeonov; Male; July 7, 1977 One Swoop Way, Cambridge, MA 02140 // consistent serialization // secure destructive hashing (SHA-xxx) // master encryption (AES-xxx) Vw50jZjh6BCWUzSVu … mfUFtyGZ3q // partner A encryption 6ykWEv7A2lisz8KUi … VT2ZddaOeML // partner B encryption Sim Simeonov; M; 1977-07-07 One Swoop Way, Suite 305, Cambridge, MA 02140 ...
  • 21. Multiple IDs for dirty data Sim|Simeonov|M|1977-07-07|02140 // full entry when data is clean S|S551|M|1977-07-07|02140 // fuzzify names to handle limited entry & typos Sim|Simeonov|M|1977-07|02140 // also may reduce dob/geo accuracy tune fuzzification to use cases & desired FP/FN rates
  • 22. Build pseudonymous IDs with Spark (and sanitize PII-based quasi-identifiers)
  • 23. We need a few user-defined functions • Strong secure hash function with very few collisions – sha256(data) computes SHA-256 • Strong symmetric key encryption – aes_encrypt(data, secret) in Hive but not ported to Spark – aes__encrypt(data, secret) is a UDF to avoid name conflict • Demo sugar to build secrets from pass phrases – secret(pass_phrase)
  • 24. Let’s create some PII case class PII(firstName: String, lastName: String, gender: String, dob: String, zip: String) val sim = PII("Sim", "Simeonov", "M", "1977-07-07", "02140") val ids = spark.createDataset(Seq(sim))
  • 25. Consistent serialization val p = lit("|") // just a pipe symbol to save us typing lazy val idRules = Seq( // Rule 1: Use all PII concat(upper('firstName), p, upper('lastName), p, 'gender, p, 'dob, p, 'zip), // Rule 2: Use only first initial of first name and soundex of last name concat(upper('firstName.substr(1, 1)), p, soundex(upper('lastName)), p, 'gender, p, 'dob, p, 'zip) )
  • 26. Hash & encrypt // The pseudonymous ID columns built from the rules lazy val psids = { val masterPassword = "Master Password" // master password to encrypt IDs with // Serialize -> Hash -> Encrypt idRules.zipWithIndex.map { case (serialization, idx) => aes__encrypt(sha256(serialization), secret(lit(masterPassword))) .as(s"psid${idx + 1}") } }
  • 27. PII-based quasi-identifiers // Generalization of quasi-identifying columns lazy val quasiIdCols: Seq[Column] = Seq( 'gender, 'dob.substr(1, 4).cast(IntegerType).as("yob"), // only year of birth 'zip.substr(1, 3).cast(IntegerType).as("zip3") // only first 3 digits of zip )
  • 28. Generate master IDs // Master pseudonymous IDs lazy val masterIds = ids.select(quasiIdCols ++ psids: _*)
  • 29. Generate per partner IDs val partnerPasswords = Map("A" -> "A Password", "B" -> "B Password") val partnerIds = spark.createDataset(partnerPasswords.toSeq) .toDF("partner_name", "pwd").withColumn("pwd", secret('pwd)) .crossJoin(masterIds) .transform { df => psids.indices.foldLeft(df) { case (current, idx) => val colName = s"psid${idx + 1}" current.withColumn(colName, base64(aes__encrypt(col(colName), 'pwd))) } } .drop("pwd")
  • 31. Sanitizing quasi-identifiers in Spark • Optimal k-anonymity is an NP-hard problem – Mondrian algorithm: greedy O(nlogn) approximation • https://github.com/eubr-bigsea/k-anonymity-mondrian • Active research – Locale-sensitive hashing (LSH) improvements – Risk-based approaches (e.g., LBS algorithm)
  • 32. Interested in challenging data engineering, ML & AI on petabytes of data? I’d love to hear from you. @simeons / sim at swoop dot com https://databricks.com/session/great-models-with-great-privacy-optimizing-ml-ai-under-gdpr https://databricks.com/session/the-smart-data-warehouse-goal-based-data-production https://swoop-inc.github.io/spark-records/ Privacy matters. Thank you for caring.