SlideShare a Scribd company logo
1 of 17
Download to read offline
AI-based re-identification
exposes privacy risk of
behavioral data. A case for
synthetic data
Michael Platzer, MOSTLY AI
Thomas Reutterer, Vienna University of
Economics and Business
Stefan Vamosi, Vienna University of
Economics and Business
May 2021
This work is supported by the “ICT of the Future” funding programme of the Austrian Federal
Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology.
SEITE 2
The Re-Identification of Netflix Data
Paper on Re-Identification on 2006-10-18
● fuzzy linkage attack
● leveraged public IMDB data as auxiliary
Netflix releases “anonymized” data on 2006-10-02
● 470K users, 18K movies, 100M ratings
● only subset of customer base
● no customer information
● some random noise to dates and ratings
Aftermath
1. class action lawsuit against Netflix → undisclosed settlement
2. hardly any public sharing of behavioral data
3. privacy regulations adapted to linkage attacks
https://gdpr-info.eu/issues/personal-data/
Personal data are any information which are related to an identified or identifiable natural person.
Recital 30 Natural persons may be associated with online identifiers provided by their devices [..] This
may leave traces which [..] may be used to create profiles of the natural persons and identify them.
Recital 26 To determine whether a natural person is identifiable, account should be taken of all the means
reasonably likely to be used, such as singling out, either by the controller or by another person to
identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be
used to identify the natural person, account should be taken of all objective factors, such as the costs of
and the amount of time required for identification, taking into consideration the available technology
at the time of the processing and technological developments.
SEITE 3
GDPR
George
Clooney
Arnold
Schwarzenegger
...
SEITE 4
AI-Based Re-Identification of Faces
?
Sylvester
Stallone
...
...
... ... ...
SEITE 5
Learning Traits of Faces with Triplet Loss
Anchor
Positive Sample
Negative Sample
Subject A
Subject A
Subject B
Train Deep Neural Network to discriminate triplets of
faces. This task will yield an embedding space
representing the characteristic traits. (Schroff et al, 2015)
-0.23 0.39 0.92 -0.02 0.05 ... -0.24
→ Re-identification is then done via Nearest-Neighbor
Search in that embedding space.
SEITE 6
Learning Traits of Behavior with TL-RNN
Train Deep Neural Network to discriminate triplets of behavioral
data. This task will yield an embedding space representing the
characteristic traits of users. (Vamosi et al, forthcoming)
-0.23 0.39 0.92 -0.02 0.05 ... -0.24
→ Re-identification is then done via Nearest-Neighbor
Search in that embedding space.
Anchor
Positive Sample
Negative Sample
Subject #123
Subject #123
Subject #789
SEITE 7
Re-Identification Study
Attack Scenario
1. Organization releases a behavioral
“anonymous” dataset D for period P1
2. Attacker obtains auxiliary behavioral data on
user X for period P2
3. Attacker then attempts to re-identify user X
within D via TL-RNN embeddings trained on D
→ Attacker reveals activities of user X within D
Linkage Attacks rely on an
overlap of the data points of the
released and the auxiliary data. A
fuzzy match on these data points
allows for re-identification.
“Anonymous” Released
Data (=Netflix)
Identified Auxiliary
Data (IMBD)
“Anonymous” Released
Data
Identified
Auxiliary Data
Pattern Attacks do not require
an overlap of the data points, but
merely of the data subjects of
the released and the auxiliary
data. A fuzzy match on the
behavioral patterns of these data
points allows for re-identification.
Dataset
● Comscore Web Browser Panel =
continuous tracking of browsing behavior
● Release January to June data for 4,000
active “anonymous” panelists
● Attempt to re-identify panelists based on
their observed July data → no overlap in
data points
Subject #? Jul
● google.com
● google.com
● booking.com
● kayak.com
● cnn.com
● weather.com
● ...
SEITE 8
Re-Identification Study
Subject #1 Jan-Jun
● expedia.com
● kayak.com
● google.com
● kayak.com
● ups.com
● usatoday.com
● ...
Subject #4000 Jan-Jun
● weather.com
● google.com
● usatoday.com
● google.com
● aol.com
● google.com
● ...
Arnold Schwarzenegger
Research Questions
1. Can we re-identify via pattern attack?
2. Can we protect with data perturbation?
3. Can we protect with data synthesis?
SEITE 9
Re-Identification Study
SEITE 10
Re-Identification Study
0.025% (=1/4000) are re-identified
via random guess
SEITE 11
Re-Identification Study
49.9% are re-identified
via TL-RNN based Pattern Attack
→ Re-Identification on behavioral traits possible
SEITE 12
Re-Identification Study
65.6% are within Top 5 candidates
via TL-RNN based Pattern Attack
→ Re-Identification on behavioral traits possible
SEITE 13
Re-Identification Study - Perturbation
*We replaced any data point with 30% probability with a data point from any other subject.
Highly destructive mechanism in an attempt to prevent re-identification.
26.6% are re-identified
despite 30% of data points being replaced*
→ Re-Identification is robust against Noise
SEITE 14
Re-Identification Study - Perturbation
*We replaced any data point with 60% probability with a data point from any other subject.
Highly destructive mechanism in an attempt to prevent re-identification.
1.1% are re-identified
despite 60% of data points being replaced*
→ Re-Identification is robust against Noise
Idea: Release Synthetic Data rather than (perturbated) Original Data
● Generative AI can provide representative datasets that strive to retain statistical properties
● no 1:1 link between actual and synthetic subjects → thus no re-identification possible
● BUT one still might leak information on individuals by memorization
SEITE 15
Re-Identification Study - Synthetization
How to Test Privacy of Synthetic Data?
● “Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data”
(Platzer et al., forthcoming)
● Require that synthetic subjects are NOT systematically closer to training
subjects than to holdout subjects
Empirical Privacy Test
● Split 4,000 subjects into 2,000 training and 2,000 holdout
● Generate 2,000 synthetic subjects based on 2,000 training
● Check whether synthetic are any closer to training than to
holdout based on TL-RNN embeddings
SEITE 16
Re-Identification Study - Synthetization
Results
● Avg Distance to Training: 0.731
● Avg Distance to Holdout: 0.737
● 51.8% are closer to training - 48.2% are closer to holdout
SEITE 17
Summary
● Sharing of behavioral data is subject to GDPR
● AI-based re-identification on behavioral traits is possible
● Data perturbation does not protect your privacy
● Data synthesis can offer true anonymization

More Related Content

What's hot

Predictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkPredictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkRussell Jurney
 
Fairness in AI (DDSW 2019)
Fairness in AI (DDSW 2019)Fairness in AI (DDSW 2019)
Fairness in AI (DDSW 2019)GoDataDriven
 
Future of Data and AI in Retail - NRF 2023
Future of Data and AI in Retail - NRF 2023Future of Data and AI in Retail - NRF 2023
Future of Data and AI in Retail - NRF 2023Rob Saker
 
AI Governance – The Responsible Use of AI
AI Governance – The Responsible Use of AIAI Governance – The Responsible Use of AI
AI Governance – The Responsible Use of AINUS-ISS
 
Global Governance of Generative AI: The Right Way Forward
Global Governance of Generative AI: The Right Way ForwardGlobal Governance of Generative AI: The Right Way Forward
Global Governance of Generative AI: The Right Way ForwardLilian Edwards
 
Synthetic data generation for machine learning
Synthetic data generation for machine learningSynthetic data generation for machine learning
Synthetic data generation for machine learningQuantUniversity
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesDATAVERSITY
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Edureka!
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKai Wähner
 
Synthetic Data Generation for Statistical Testing
Synthetic Data Generation for Statistical TestingSynthetic Data Generation for Statistical Testing
Synthetic Data Generation for Statistical TestingLionel Briand
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaScyllaDB
 
Driving Business Insights with a Modern Data Architecture AWS Summit SG 2017
Driving Business Insights with a Modern Data Architecture  AWS Summit SG 2017Driving Business Insights with a Modern Data Architecture  AWS Summit SG 2017
Driving Business Insights with a Modern Data Architecture AWS Summit SG 2017Amazon Web Services
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data ScienceSpotle.ai
 
Machine Learning in Banking Sector
Machine Learning in Banking SectorMachine Learning in Banking Sector
Machine Learning in Banking SectorKnoldus Inc.
 
Building a Data Analytics Portfolio
Building a Data Analytics PortfolioBuilding a Data Analytics Portfolio
Building a Data Analytics PortfolioJamie Renehan, FCCA
 
Generative AI in insurance- A comprehensive guide.pdf
Generative AI in insurance- A comprehensive guide.pdfGenerative AI in insurance- A comprehensive guide.pdf
Generative AI in insurance- A comprehensive guide.pdfStephenAmell4
 
Late Binding in Data Warehouses
Late Binding in Data WarehousesLate Binding in Data Warehouses
Late Binding in Data WarehousesDale Sanders
 

What's hot (20)

Predictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkPredictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySpark
 
Fairness in AI (DDSW 2019)
Fairness in AI (DDSW 2019)Fairness in AI (DDSW 2019)
Fairness in AI (DDSW 2019)
 
Future of Data and AI in Retail - NRF 2023
Future of Data and AI in Retail - NRF 2023Future of Data and AI in Retail - NRF 2023
Future of Data and AI in Retail - NRF 2023
 
AI Governance – The Responsible Use of AI
AI Governance – The Responsible Use of AIAI Governance – The Responsible Use of AI
AI Governance – The Responsible Use of AI
 
Global Governance of Generative AI: The Right Way Forward
Global Governance of Generative AI: The Right Way ForwardGlobal Governance of Generative AI: The Right Way Forward
Global Governance of Generative AI: The Right Way Forward
 
Synthetic data generation for machine learning
Synthetic data generation for machine learningSynthetic data generation for machine learning
Synthetic data generation for machine learning
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
 
UTILITY OF AI
UTILITY OF AIUTILITY OF AI
UTILITY OF AI
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
 
Synthetic Data Generation for Statistical Testing
Synthetic Data Generation for Statistical TestingSynthetic Data Generation for Statistical Testing
Synthetic Data Generation for Statistical Testing
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Driving Business Insights with a Modern Data Architecture AWS Summit SG 2017
Driving Business Insights with a Modern Data Architecture  AWS Summit SG 2017Driving Business Insights with a Modern Data Architecture  AWS Summit SG 2017
Driving Business Insights with a Modern Data Architecture AWS Summit SG 2017
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Apply (Big) Data Analytics & Predictive Analytics to Business Application
Apply (Big) Data Analytics & Predictive Analytics to Business ApplicationApply (Big) Data Analytics & Predictive Analytics to Business Application
Apply (Big) Data Analytics & Predictive Analytics to Business Application
 
Machine Learning in Banking Sector
Machine Learning in Banking SectorMachine Learning in Banking Sector
Machine Learning in Banking Sector
 
Building a Data Analytics Portfolio
Building a Data Analytics PortfolioBuilding a Data Analytics Portfolio
Building a Data Analytics Portfolio
 
Generative AI in insurance- A comprehensive guide.pdf
Generative AI in insurance- A comprehensive guide.pdfGenerative AI in insurance- A comprehensive guide.pdf
Generative AI in insurance- A comprehensive guide.pdf
 
Late Binding in Data Warehouses
Late Binding in Data WarehousesLate Binding in Data Warehouses
Late Binding in Data Warehouses
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 

Similar to AI-based re-identification of behavioral data

Everything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic DataEverything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic DataMOSTLY AI
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltoolssuresh sood
 
DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making
DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making
DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making DigitYser
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxYogeshGairola2
 
Sentiment Analysis In Retail Domain
Sentiment Analysis In Retail DomainSentiment Analysis In Retail Domain
Sentiment Analysis In Retail DomainEdureka!
 
Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18Dan Elton
 
AI-Driven Logical Argumentation in Active Cyber Defense
AI-Driven Logical Argumentation in Active Cyber DefenseAI-Driven Logical Argumentation in Active Cyber Defense
AI-Driven Logical Argumentation in Active Cyber DefenseShawn Riley
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for properIJDKP
 
Achieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logsAchieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logsIOSR Journals
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET Journal
 
Lecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfLecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfJojo314349
 
Internet of Things: Lightning Round, Hite
Internet of Things: Lightning Round, HiteInternet of Things: Lightning Round, Hite
Internet of Things: Lightning Round, HiteGovLoop
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applicationsSubrat Swain
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and butest
 
Qu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in FinanceQu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in FinanceQuantUniversity
 
Application To Monitor And Manage People In Crowded Places Using Neural Networks
Application To Monitor And Manage People In Crowded Places Using Neural NetworksApplication To Monitor And Manage People In Crowded Places Using Neural Networks
Application To Monitor And Manage People In Crowded Places Using Neural NetworksIJSRED
 
Fake News Detection using Passive Aggressive and Naïve Bayes
Fake News Detection using Passive Aggressive and Naïve BayesFake News Detection using Passive Aggressive and Naïve Bayes
Fake News Detection using Passive Aggressive and Naïve BayesIRJET Journal
 

Similar to AI-based re-identification of behavioral data (20)

Everything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic DataEverything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic Data
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltools
 
DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making
DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making
DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making
 
Differential privacy and ml
Differential privacy and mlDifferential privacy and ml
Differential privacy and ml
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptx
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
Sentiment Analysis In Retail Domain
Sentiment Analysis In Retail DomainSentiment Analysis In Retail Domain
Sentiment Analysis In Retail Domain
 
Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18
 
AI-Driven Logical Argumentation in Active Cyber Defense
AI-Driven Logical Argumentation in Active Cyber DefenseAI-Driven Logical Argumentation in Active Cyber Defense
AI-Driven Logical Argumentation in Active Cyber Defense
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
 
ProjectReport
ProjectReportProjectReport
ProjectReport
 
Achieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logsAchieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logs
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
 
Lecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfLecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdf
 
Internet of Things: Lightning Round, Hite
Internet of Things: Lightning Round, HiteInternet of Things: Lightning Round, Hite
Internet of Things: Lightning Round, Hite
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and
 
Qu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in FinanceQu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in Finance
 
Application To Monitor And Manage People In Crowded Places Using Neural Networks
Application To Monitor And Manage People In Crowded Places Using Neural NetworksApplication To Monitor And Manage People In Crowded Places Using Neural Networks
Application To Monitor And Manage People In Crowded Places Using Neural Networks
 
Fake News Detection using Passive Aggressive and Naïve Bayes
Fake News Detection using Passive Aggressive and Naïve BayesFake News Detection using Passive Aggressive and Naïve Bayes
Fake News Detection using Passive Aggressive and Naïve Bayes
 

More from MOSTLY AI

Everything you always wanted to know about Synthetic Data
Everything you always wanted to know about Synthetic DataEverything you always wanted to know about Synthetic Data
Everything you always wanted to know about Synthetic DataMOSTLY AI
 
Nvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer TöglhoferNvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer TöglhoferMOSTLY AI
 
Artificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines LearnArtificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines LearnMOSTLY AI
 
PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016MOSTLY AI
 
My Entry to the DMEF CLV Contest
My Entry to the DMEF CLV ContestMy Entry to the DMEF CLV Contest
My Entry to the DMEF CLV ContestMOSTLY AI
 
Stochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer RelationshipsStochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer RelationshipsMOSTLY AI
 
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...MOSTLY AI
 

More from MOSTLY AI (7)

Everything you always wanted to know about Synthetic Data
Everything you always wanted to know about Synthetic DataEverything you always wanted to know about Synthetic Data
Everything you always wanted to know about Synthetic Data
 
Nvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer TöglhoferNvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer Töglhofer
 
Artificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines LearnArtificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines Learn
 
PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016
 
My Entry to the DMEF CLV Contest
My Entry to the DMEF CLV ContestMy Entry to the DMEF CLV Contest
My Entry to the DMEF CLV Contest
 
Stochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer RelationshipsStochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer Relationships
 
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
 

Recently uploaded

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 

Recently uploaded (20)

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 

AI-based re-identification of behavioral data

  • 1. AI-based re-identification exposes privacy risk of behavioral data. A case for synthetic data Michael Platzer, MOSTLY AI Thomas Reutterer, Vienna University of Economics and Business Stefan Vamosi, Vienna University of Economics and Business May 2021 This work is supported by the “ICT of the Future” funding programme of the Austrian Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology.
  • 2. SEITE 2 The Re-Identification of Netflix Data Paper on Re-Identification on 2006-10-18 ● fuzzy linkage attack ● leveraged public IMDB data as auxiliary Netflix releases “anonymized” data on 2006-10-02 ● 470K users, 18K movies, 100M ratings ● only subset of customer base ● no customer information ● some random noise to dates and ratings Aftermath 1. class action lawsuit against Netflix → undisclosed settlement 2. hardly any public sharing of behavioral data 3. privacy regulations adapted to linkage attacks
  • 3. https://gdpr-info.eu/issues/personal-data/ Personal data are any information which are related to an identified or identifiable natural person. Recital 30 Natural persons may be associated with online identifiers provided by their devices [..] This may leave traces which [..] may be used to create profiles of the natural persons and identify them. Recital 26 To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments. SEITE 3 GDPR
  • 4. George Clooney Arnold Schwarzenegger ... SEITE 4 AI-Based Re-Identification of Faces ? Sylvester Stallone ... ... ... ... ...
  • 5. SEITE 5 Learning Traits of Faces with Triplet Loss Anchor Positive Sample Negative Sample Subject A Subject A Subject B Train Deep Neural Network to discriminate triplets of faces. This task will yield an embedding space representing the characteristic traits. (Schroff et al, 2015) -0.23 0.39 0.92 -0.02 0.05 ... -0.24 → Re-identification is then done via Nearest-Neighbor Search in that embedding space.
  • 6. SEITE 6 Learning Traits of Behavior with TL-RNN Train Deep Neural Network to discriminate triplets of behavioral data. This task will yield an embedding space representing the characteristic traits of users. (Vamosi et al, forthcoming) -0.23 0.39 0.92 -0.02 0.05 ... -0.24 → Re-identification is then done via Nearest-Neighbor Search in that embedding space. Anchor Positive Sample Negative Sample Subject #123 Subject #123 Subject #789
  • 7. SEITE 7 Re-Identification Study Attack Scenario 1. Organization releases a behavioral “anonymous” dataset D for period P1 2. Attacker obtains auxiliary behavioral data on user X for period P2 3. Attacker then attempts to re-identify user X within D via TL-RNN embeddings trained on D → Attacker reveals activities of user X within D Linkage Attacks rely on an overlap of the data points of the released and the auxiliary data. A fuzzy match on these data points allows for re-identification. “Anonymous” Released Data (=Netflix) Identified Auxiliary Data (IMBD) “Anonymous” Released Data Identified Auxiliary Data Pattern Attacks do not require an overlap of the data points, but merely of the data subjects of the released and the auxiliary data. A fuzzy match on the behavioral patterns of these data points allows for re-identification.
  • 8. Dataset ● Comscore Web Browser Panel = continuous tracking of browsing behavior ● Release January to June data for 4,000 active “anonymous” panelists ● Attempt to re-identify panelists based on their observed July data → no overlap in data points Subject #? Jul ● google.com ● google.com ● booking.com ● kayak.com ● cnn.com ● weather.com ● ... SEITE 8 Re-Identification Study Subject #1 Jan-Jun ● expedia.com ● kayak.com ● google.com ● kayak.com ● ups.com ● usatoday.com ● ... Subject #4000 Jan-Jun ● weather.com ● google.com ● usatoday.com ● google.com ● aol.com ● google.com ● ... Arnold Schwarzenegger
  • 9. Research Questions 1. Can we re-identify via pattern attack? 2. Can we protect with data perturbation? 3. Can we protect with data synthesis? SEITE 9 Re-Identification Study
  • 10. SEITE 10 Re-Identification Study 0.025% (=1/4000) are re-identified via random guess
  • 11. SEITE 11 Re-Identification Study 49.9% are re-identified via TL-RNN based Pattern Attack → Re-Identification on behavioral traits possible
  • 12. SEITE 12 Re-Identification Study 65.6% are within Top 5 candidates via TL-RNN based Pattern Attack → Re-Identification on behavioral traits possible
  • 13. SEITE 13 Re-Identification Study - Perturbation *We replaced any data point with 30% probability with a data point from any other subject. Highly destructive mechanism in an attempt to prevent re-identification. 26.6% are re-identified despite 30% of data points being replaced* → Re-Identification is robust against Noise
  • 14. SEITE 14 Re-Identification Study - Perturbation *We replaced any data point with 60% probability with a data point from any other subject. Highly destructive mechanism in an attempt to prevent re-identification. 1.1% are re-identified despite 60% of data points being replaced* → Re-Identification is robust against Noise
  • 15. Idea: Release Synthetic Data rather than (perturbated) Original Data ● Generative AI can provide representative datasets that strive to retain statistical properties ● no 1:1 link between actual and synthetic subjects → thus no re-identification possible ● BUT one still might leak information on individuals by memorization SEITE 15 Re-Identification Study - Synthetization How to Test Privacy of Synthetic Data? ● “Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data” (Platzer et al., forthcoming) ● Require that synthetic subjects are NOT systematically closer to training subjects than to holdout subjects
  • 16. Empirical Privacy Test ● Split 4,000 subjects into 2,000 training and 2,000 holdout ● Generate 2,000 synthetic subjects based on 2,000 training ● Check whether synthetic are any closer to training than to holdout based on TL-RNN embeddings SEITE 16 Re-Identification Study - Synthetization Results ● Avg Distance to Training: 0.731 ● Avg Distance to Holdout: 0.737 ● 51.8% are closer to training - 48.2% are closer to holdout
  • 17. SEITE 17 Summary ● Sharing of behavioral data is subject to GDPR ● AI-based re-identification on behavioral traits is possible ● Data perturbation does not protect your privacy ● Data synthesis can offer true anonymization