SlideShare a Scribd company logo
AI-based re-identification
exposes privacy risk of
behavioral data. A case for
synthetic data
Michael Platzer, MOSTLY AI
Thomas Reutterer, Vienna University of
Economics and Business
Stefan Vamosi, Vienna University of
Economics and Business
May 2021
This work is supported by the “ICT of the Future” funding programme of the Austrian Federal
Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology.
SEITE 2
The Re-Identification of Netflix Data
Paper on Re-Identification on 2006-10-18
● fuzzy linkage attack
● leveraged public IMDB data as auxiliary
Netflix releases “anonymized” data on 2006-10-02
● 470K users, 18K movies, 100M ratings
● only subset of customer base
● no customer information
● some random noise to dates and ratings
Aftermath
1. class action lawsuit against Netflix → undisclosed settlement
2. hardly any public sharing of behavioral data
3. privacy regulations adapted to linkage attacks
https://gdpr-info.eu/issues/personal-data/
Personal data are any information which are related to an identified or identifiable natural person.
Recital 30 Natural persons may be associated with online identifiers provided by their devices [..] This
may leave traces which [..] may be used to create profiles of the natural persons and identify them.
Recital 26 To determine whether a natural person is identifiable, account should be taken of all the means
reasonably likely to be used, such as singling out, either by the controller or by another person to
identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be
used to identify the natural person, account should be taken of all objective factors, such as the costs of
and the amount of time required for identification, taking into consideration the available technology
at the time of the processing and technological developments.
SEITE 3
GDPR
George
Clooney
Arnold
Schwarzenegger
...
SEITE 4
AI-Based Re-Identification of Faces
?
Sylvester
Stallone
...
...
... ... ...
SEITE 5
Learning Traits of Faces with Triplet Loss
Anchor
Positive Sample
Negative Sample
Subject A
Subject A
Subject B
Train Deep Neural Network to discriminate triplets of
faces. This task will yield an embedding space
representing the characteristic traits. (Schroff et al, 2015)
-0.23 0.39 0.92 -0.02 0.05 ... -0.24
→ Re-identification is then done via Nearest-Neighbor
Search in that embedding space.
SEITE 6
Learning Traits of Behavior with TL-RNN
Train Deep Neural Network to discriminate triplets of behavioral
data. This task will yield an embedding space representing the
characteristic traits of users. (Vamosi et al, forthcoming)
-0.23 0.39 0.92 -0.02 0.05 ... -0.24
→ Re-identification is then done via Nearest-Neighbor
Search in that embedding space.
Anchor
Positive Sample
Negative Sample
Subject #123
Subject #123
Subject #789
SEITE 7
Re-Identification Study
Attack Scenario
1. Organization releases a behavioral
“anonymous” dataset D for period P1
2. Attacker obtains auxiliary behavioral data on
user X for period P2
3. Attacker then attempts to re-identify user X
within D via TL-RNN embeddings trained on D
→ Attacker reveals activities of user X within D
Linkage Attacks rely on an
overlap of the data points of the
released and the auxiliary data. A
fuzzy match on these data points
allows for re-identification.
“Anonymous” Released
Data (=Netflix)
Identified Auxiliary
Data (IMBD)
“Anonymous” Released
Data
Identified
Auxiliary Data
Pattern Attacks do not require
an overlap of the data points, but
merely of the data subjects of
the released and the auxiliary
data. A fuzzy match on the
behavioral patterns of these data
points allows for re-identification.
Dataset
● Comscore Web Browser Panel =
continuous tracking of browsing behavior
● Release January to June data for 4,000
active “anonymous” panelists
● Attempt to re-identify panelists based on
their observed July data → no overlap in
data points
Subject #? Jul
● google.com
● google.com
● booking.com
● kayak.com
● cnn.com
● weather.com
● ...
SEITE 8
Re-Identification Study
Subject #1 Jan-Jun
● expedia.com
● kayak.com
● google.com
● kayak.com
● ups.com
● usatoday.com
● ...
Subject #4000 Jan-Jun
● weather.com
● google.com
● usatoday.com
● google.com
● aol.com
● google.com
● ...
Arnold Schwarzenegger
Research Questions
1. Can we re-identify via pattern attack?
2. Can we protect with data perturbation?
3. Can we protect with data synthesis?
SEITE 9
Re-Identification Study
SEITE 10
Re-Identification Study
0.025% (=1/4000) are re-identified
via random guess
SEITE 11
Re-Identification Study
49.9% are re-identified
via TL-RNN based Pattern Attack
→ Re-Identification on behavioral traits possible
SEITE 12
Re-Identification Study
65.6% are within Top 5 candidates
via TL-RNN based Pattern Attack
→ Re-Identification on behavioral traits possible
SEITE 13
Re-Identification Study - Perturbation
*We replaced any data point with 30% probability with a data point from any other subject.
Highly destructive mechanism in an attempt to prevent re-identification.
26.6% are re-identified
despite 30% of data points being replaced*
→ Re-Identification is robust against Noise
SEITE 14
Re-Identification Study - Perturbation
*We replaced any data point with 60% probability with a data point from any other subject.
Highly destructive mechanism in an attempt to prevent re-identification.
1.1% are re-identified
despite 60% of data points being replaced*
→ Re-Identification is robust against Noise
Idea: Release Synthetic Data rather than (perturbated) Original Data
● Generative AI can provide representative datasets that strive to retain statistical properties
● no 1:1 link between actual and synthetic subjects → thus no re-identification possible
● BUT one still might leak information on individuals by memorization
SEITE 15
Re-Identification Study - Synthetization
How to Test Privacy of Synthetic Data?
● “Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data”
(Platzer et al., forthcoming)
● Require that synthetic subjects are NOT systematically closer to training
subjects than to holdout subjects
Empirical Privacy Test
● Split 4,000 subjects into 2,000 training and 2,000 holdout
● Generate 2,000 synthetic subjects based on 2,000 training
● Check whether synthetic are any closer to training than to
holdout based on TL-RNN embeddings
SEITE 16
Re-Identification Study - Synthetization
Results
● Avg Distance to Training: 0.731
● Avg Distance to Holdout: 0.737
● 51.8% are closer to training - 48.2% are closer to holdout
SEITE 17
Summary
● Sharing of behavioral data is subject to GDPR
● AI-based re-identification on behavioral traits is possible
● Data perturbation does not protect your privacy
● Data synthesis can offer true anonymization

More Related Content

What's hot

Orion Context Broker 1.15.0
Orion Context Broker 1.15.0Orion Context Broker 1.15.0
Orion Context Broker 1.15.0Fermin Galan
 
Drug Discovery Knowledge Graph
Drug Discovery Knowledge Graph Drug Discovery Knowledge Graph
Drug Discovery Knowledge Graph Tomás Sabat
 
Using synthetic data for computer vision model training
Using synthetic data for computer vision model trainingUsing synthetic data for computer vision model training
Using synthetic data for computer vision model trainingUnity Technologies
 
Amazon Game Tech Night #22 AWSで実現するデータレイクとアナリティクス
Amazon Game Tech Night #22 AWSで実現するデータレイクとアナリティクスAmazon Game Tech Night #22 AWSで実現するデータレイクとアナリティクス
Amazon Game Tech Night #22 AWSで実現するデータレイクとアナリティクスAmazon Web Services Japan
 
Graph Database Meetup in Korea #8. Graph Database 5 Offerings_ DecisionTutor ...
Graph Database Meetup in Korea #8. Graph Database 5 Offerings_ DecisionTutor ...Graph Database Meetup in Korea #8. Graph Database 5 Offerings_ DecisionTutor ...
Graph Database Meetup in Korea #8. Graph Database 5 Offerings_ DecisionTutor ...bitnineglobal
 
Taking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureTaking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureSplunk
 
삼성전자 5G Core CNF를 위한 클라우드 여정 이야기 - 최우형 AWS 솔루션즈 아키텍트 / 구동영 프로, 삼성전자 :: AWS Su...
삼성전자 5G Core CNF를 위한 클라우드 여정 이야기 - 최우형 AWS 솔루션즈 아키텍트 / 구동영 프로, 삼성전자 :: AWS Su...삼성전자 5G Core CNF를 위한 클라우드 여정 이야기 - 최우형 AWS 솔루션즈 아키텍트 / 구동영 프로, 삼성전자 :: AWS Su...
삼성전자 5G Core CNF를 위한 클라우드 여정 이야기 - 최우형 AWS 솔루션즈 아키텍트 / 구동영 프로, 삼성전자 :: AWS Su...Amazon Web Services Korea
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systemsDave Gardner
 
[AWS Builders] AWS상의 보안 위협 탐지 및 대응
[AWS Builders] AWS상의 보안 위협 탐지 및 대응[AWS Builders] AWS상의 보안 위협 탐지 및 대응
[AWS Builders] AWS상의 보안 위협 탐지 및 대응Amazon Web Services Korea
 
Amazon Pinpoint を中心としたカスタマーエンゲージメントの全体像 / Customer Engagement On Amazon Pinpoint
Amazon Pinpoint を中心としたカスタマーエンゲージメントの全体像 / Customer Engagement On Amazon PinpointAmazon Pinpoint を中心としたカスタマーエンゲージメントの全体像 / Customer Engagement On Amazon Pinpoint
Amazon Pinpoint を中心としたカスタマーエンゲージメントの全体像 / Customer Engagement On Amazon PinpointAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2018 動画配信 on AWS
AWS Black Belt Online Seminar 2018 動画配信 on AWSAWS Black Belt Online Seminar 2018 動画配信 on AWS
AWS Black Belt Online Seminar 2018 動画配信 on AWSAmazon Web Services Japan
 
초보자를 위한 네트워크/VLAN 기초
초보자를 위한 네트워크/VLAN 기초초보자를 위한 네트워크/VLAN 기초
초보자를 위한 네트워크/VLAN 기초Open Source Consulting
 
AWS Lambda를 기반으로한 실시간 빅테이터 처리하기
AWS Lambda를 기반으로한 실시간 빅테이터 처리하기AWS Lambda를 기반으로한 실시간 빅테이터 처리하기
AWS Lambda를 기반으로한 실시간 빅테이터 처리하기Amazon Web Services Korea
 
[보험사를 위한 AWS Data Analytics Day] 5_KB금융그룹과 계열사의 AWS 기ᄇ...
[보험사를 위한 AWS Data Analytics Day] 5_KB금융그룹과 계열사의 AWS 기ᄇ...[보험사를 위한 AWS Data Analytics Day] 5_KB금융그룹과 계열사의 AWS 기ᄇ...
[보험사를 위한 AWS Data Analytics Day] 5_KB금융그룹과 계열사의 AWS 기ᄇ...AWS Korea 금융산업팀
 
Greenplum 6 Changes
Greenplum 6 ChangesGreenplum 6 Changes
Greenplum 6 ChangesVMware Tanzu
 
Airbyte - Series-A deck
Airbyte - Series-A deckAirbyte - Series-A deck
Airbyte - Series-A deckAirbyte
 
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례 Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례 bitnineglobal
 
AWS Finance Symposium_국내 메이저 증권사의 클라우드 글로벌 로드밸런서 활용 사례 (gslb)
AWS Finance Symposium_국내 메이저 증권사의 클라우드 글로벌 로드밸런서 활용 사례 (gslb)AWS Finance Symposium_국내 메이저 증권사의 클라우드 글로벌 로드밸런서 활용 사례 (gslb)
AWS Finance Symposium_국내 메이저 증권사의 클라우드 글로벌 로드밸런서 활용 사례 (gslb)Amazon Web Services Korea
 

What's hot (20)

Orion Context Broker 1.15.0
Orion Context Broker 1.15.0Orion Context Broker 1.15.0
Orion Context Broker 1.15.0
 
Drug Discovery Knowledge Graph
Drug Discovery Knowledge Graph Drug Discovery Knowledge Graph
Drug Discovery Knowledge Graph
 
Using synthetic data for computer vision model training
Using synthetic data for computer vision model trainingUsing synthetic data for computer vision model training
Using synthetic data for computer vision model training
 
Amazon Game Tech Night #22 AWSで実現するデータレイクとアナリティクス
Amazon Game Tech Night #22 AWSで実現するデータレイクとアナリティクスAmazon Game Tech Night #22 AWSで実現するデータレイクとアナリティクス
Amazon Game Tech Night #22 AWSで実現するデータレイクとアナリティクス
 
Graph Database Meetup in Korea #8. Graph Database 5 Offerings_ DecisionTutor ...
Graph Database Meetup in Korea #8. Graph Database 5 Offerings_ DecisionTutor ...Graph Database Meetup in Korea #8. Graph Database 5 Offerings_ DecisionTutor ...
Graph Database Meetup in Korea #8. Graph Database 5 Offerings_ DecisionTutor ...
 
Taking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureTaking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - Architecture
 
삼성전자 5G Core CNF를 위한 클라우드 여정 이야기 - 최우형 AWS 솔루션즈 아키텍트 / 구동영 프로, 삼성전자 :: AWS Su...
삼성전자 5G Core CNF를 위한 클라우드 여정 이야기 - 최우형 AWS 솔루션즈 아키텍트 / 구동영 프로, 삼성전자 :: AWS Su...삼성전자 5G Core CNF를 위한 클라우드 여정 이야기 - 최우형 AWS 솔루션즈 아키텍트 / 구동영 프로, 삼성전자 :: AWS Su...
삼성전자 5G Core CNF를 위한 클라우드 여정 이야기 - 최우형 AWS 솔루션즈 아키텍트 / 구동영 프로, 삼성전자 :: AWS Su...
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systems
 
[AWS Builders] AWS상의 보안 위협 탐지 및 대응
[AWS Builders] AWS상의 보안 위협 탐지 및 대응[AWS Builders] AWS상의 보안 위협 탐지 및 대응
[AWS Builders] AWS상의 보안 위협 탐지 및 대응
 
Amazon Pinpoint を中心としたカスタマーエンゲージメントの全体像 / Customer Engagement On Amazon Pinpoint
Amazon Pinpoint を中心としたカスタマーエンゲージメントの全体像 / Customer Engagement On Amazon PinpointAmazon Pinpoint を中心としたカスタマーエンゲージメントの全体像 / Customer Engagement On Amazon Pinpoint
Amazon Pinpoint を中心としたカスタマーエンゲージメントの全体像 / Customer Engagement On Amazon Pinpoint
 
[4차]왓챠 알고리즘 분석(151106)
[4차]왓챠 알고리즘 분석(151106)[4차]왓챠 알고리즘 분석(151106)
[4차]왓챠 알고리즘 분석(151106)
 
AWS Black Belt Online Seminar 2018 動画配信 on AWS
AWS Black Belt Online Seminar 2018 動画配信 on AWSAWS Black Belt Online Seminar 2018 動画配信 on AWS
AWS Black Belt Online Seminar 2018 動画配信 on AWS
 
초보자를 위한 네트워크/VLAN 기초
초보자를 위한 네트워크/VLAN 기초초보자를 위한 네트워크/VLAN 기초
초보자를 위한 네트워크/VLAN 기초
 
AWS Lambda를 기반으로한 실시간 빅테이터 처리하기
AWS Lambda를 기반으로한 실시간 빅테이터 처리하기AWS Lambda를 기반으로한 실시간 빅테이터 처리하기
AWS Lambda를 기반으로한 실시간 빅테이터 처리하기
 
[보험사를 위한 AWS Data Analytics Day] 5_KB금융그룹과 계열사의 AWS 기ᄇ...
[보험사를 위한 AWS Data Analytics Day] 5_KB금융그룹과 계열사의 AWS 기ᄇ...[보험사를 위한 AWS Data Analytics Day] 5_KB금융그룹과 계열사의 AWS 기ᄇ...
[보험사를 위한 AWS Data Analytics Day] 5_KB금융그룹과 계열사의 AWS 기ᄇ...
 
VPNaaS in Neutron
VPNaaS in NeutronVPNaaS in Neutron
VPNaaS in Neutron
 
Greenplum 6 Changes
Greenplum 6 ChangesGreenplum 6 Changes
Greenplum 6 Changes
 
Airbyte - Series-A deck
Airbyte - Series-A deckAirbyte - Series-A deck
Airbyte - Series-A deck
 
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례 Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례
 
AWS Finance Symposium_국내 메이저 증권사의 클라우드 글로벌 로드밸런서 활용 사례 (gslb)
AWS Finance Symposium_국내 메이저 증권사의 클라우드 글로벌 로드밸런서 활용 사례 (gslb)AWS Finance Symposium_국내 메이저 증권사의 클라우드 글로벌 로드밸런서 활용 사례 (gslb)
AWS Finance Symposium_국내 메이저 증권사의 클라우드 글로벌 로드밸런서 활용 사례 (gslb)
 

Similar to AI-based re-identification of behavioral data

Everything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic DataEverything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic DataMOSTLY AI
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltoolssuresh sood
 
DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making
DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making
DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making DigitYser
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxYogeshGairola2
 
Sentiment Analysis In Retail Domain
Sentiment Analysis In Retail DomainSentiment Analysis In Retail Domain
Sentiment Analysis In Retail DomainEdureka!
 
Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18Dan Elton
 
AI-Driven Logical Argumentation in Active Cyber Defense
AI-Driven Logical Argumentation in Active Cyber DefenseAI-Driven Logical Argumentation in Active Cyber Defense
AI-Driven Logical Argumentation in Active Cyber DefenseShawn Riley
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for properIJDKP
 
Achieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logsAchieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logsIOSR Journals
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET Journal
 
Lecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfLecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfJojo314349
 
Internet of Things: Lightning Round, Hite
Internet of Things: Lightning Round, HiteInternet of Things: Lightning Round, Hite
Internet of Things: Lightning Round, HiteGovLoop
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applicationsSubrat Swain
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and butest
 
Qu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in FinanceQu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in FinanceQuantUniversity
 
Application To Monitor And Manage People In Crowded Places Using Neural Networks
Application To Monitor And Manage People In Crowded Places Using Neural NetworksApplication To Monitor And Manage People In Crowded Places Using Neural Networks
Application To Monitor And Manage People In Crowded Places Using Neural NetworksIJSRED
 
Fake News Detection using Passive Aggressive and Naïve Bayes
Fake News Detection using Passive Aggressive and Naïve BayesFake News Detection using Passive Aggressive and Naïve Bayes
Fake News Detection using Passive Aggressive and Naïve BayesIRJET Journal
 

Similar to AI-based re-identification of behavioral data (20)

Everything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic DataEverything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic Data
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltools
 
DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making
DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making
DISUMMIT Keynote presentation from Kirk Borne - From Sensors to Sense-Making
 
Differential privacy and ml
Differential privacy and mlDifferential privacy and ml
Differential privacy and ml
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptx
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
Sentiment Analysis In Retail Domain
Sentiment Analysis In Retail DomainSentiment Analysis In Retail Domain
Sentiment Analysis In Retail Domain
 
Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18
 
AI-Driven Logical Argumentation in Active Cyber Defense
AI-Driven Logical Argumentation in Active Cyber DefenseAI-Driven Logical Argumentation in Active Cyber Defense
AI-Driven Logical Argumentation in Active Cyber Defense
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
 
ProjectReport
ProjectReportProjectReport
ProjectReport
 
Achieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logsAchieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logs
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
 
Lecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfLecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdf
 
Internet of Things: Lightning Round, Hite
Internet of Things: Lightning Round, HiteInternet of Things: Lightning Round, Hite
Internet of Things: Lightning Round, Hite
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and
 
Qu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in FinanceQu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in Finance
 
Application To Monitor And Manage People In Crowded Places Using Neural Networks
Application To Monitor And Manage People In Crowded Places Using Neural NetworksApplication To Monitor And Manage People In Crowded Places Using Neural Networks
Application To Monitor And Manage People In Crowded Places Using Neural Networks
 
Fake News Detection using Passive Aggressive and Naïve Bayes
Fake News Detection using Passive Aggressive and Naïve BayesFake News Detection using Passive Aggressive and Naïve Bayes
Fake News Detection using Passive Aggressive and Naïve Bayes
 

More from MOSTLY AI

Everything you always wanted to know about Synthetic Data
Everything you always wanted to know about Synthetic DataEverything you always wanted to know about Synthetic Data
Everything you always wanted to know about Synthetic DataMOSTLY AI
 
Synthetic Population Data with MOSTLY AI
Synthetic Population Data with MOSTLY AISynthetic Population Data with MOSTLY AI
Synthetic Population Data with MOSTLY AIMOSTLY AI
 
Nvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer TöglhoferNvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer TöglhoferMOSTLY AI
 
Artificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines LearnArtificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines LearnMOSTLY AI
 
PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016MOSTLY AI
 
My Entry to the DMEF CLV Contest
My Entry to the DMEF CLV ContestMy Entry to the DMEF CLV Contest
My Entry to the DMEF CLV ContestMOSTLY AI
 
Stochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer RelationshipsStochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer RelationshipsMOSTLY AI
 
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...MOSTLY AI
 

More from MOSTLY AI (8)

Everything you always wanted to know about Synthetic Data
Everything you always wanted to know about Synthetic DataEverything you always wanted to know about Synthetic Data
Everything you always wanted to know about Synthetic Data
 
Synthetic Population Data with MOSTLY AI
Synthetic Population Data with MOSTLY AISynthetic Population Data with MOSTLY AI
Synthetic Population Data with MOSTLY AI
 
Nvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer TöglhoferNvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer Töglhofer
 
Artificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines LearnArtificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines Learn
 
PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016
 
My Entry to the DMEF CLV Contest
My Entry to the DMEF CLV ContestMy Entry to the DMEF CLV Contest
My Entry to the DMEF CLV Contest
 
Stochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer RelationshipsStochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer Relationships
 
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
 

Recently uploaded

一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单vcaxypu
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...correoyaya
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatheahmadsaood
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单enxupq
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesStarCompliance.io
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportSatyamNeelmani2
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单nscud
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单ewymefz
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单ukgaet
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJames Polillo
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单ewymefz
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsalex933524
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIAlejandraGmez176757
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundOppotus
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhArpitMalhotra16
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单enxupq
 

Recently uploaded (20)

一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 

AI-based re-identification of behavioral data

  • 1. AI-based re-identification exposes privacy risk of behavioral data. A case for synthetic data Michael Platzer, MOSTLY AI Thomas Reutterer, Vienna University of Economics and Business Stefan Vamosi, Vienna University of Economics and Business May 2021 This work is supported by the “ICT of the Future” funding programme of the Austrian Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology.
  • 2. SEITE 2 The Re-Identification of Netflix Data Paper on Re-Identification on 2006-10-18 ● fuzzy linkage attack ● leveraged public IMDB data as auxiliary Netflix releases “anonymized” data on 2006-10-02 ● 470K users, 18K movies, 100M ratings ● only subset of customer base ● no customer information ● some random noise to dates and ratings Aftermath 1. class action lawsuit against Netflix → undisclosed settlement 2. hardly any public sharing of behavioral data 3. privacy regulations adapted to linkage attacks
  • 3. https://gdpr-info.eu/issues/personal-data/ Personal data are any information which are related to an identified or identifiable natural person. Recital 30 Natural persons may be associated with online identifiers provided by their devices [..] This may leave traces which [..] may be used to create profiles of the natural persons and identify them. Recital 26 To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments. SEITE 3 GDPR
  • 4. George Clooney Arnold Schwarzenegger ... SEITE 4 AI-Based Re-Identification of Faces ? Sylvester Stallone ... ... ... ... ...
  • 5. SEITE 5 Learning Traits of Faces with Triplet Loss Anchor Positive Sample Negative Sample Subject A Subject A Subject B Train Deep Neural Network to discriminate triplets of faces. This task will yield an embedding space representing the characteristic traits. (Schroff et al, 2015) -0.23 0.39 0.92 -0.02 0.05 ... -0.24 → Re-identification is then done via Nearest-Neighbor Search in that embedding space.
  • 6. SEITE 6 Learning Traits of Behavior with TL-RNN Train Deep Neural Network to discriminate triplets of behavioral data. This task will yield an embedding space representing the characteristic traits of users. (Vamosi et al, forthcoming) -0.23 0.39 0.92 -0.02 0.05 ... -0.24 → Re-identification is then done via Nearest-Neighbor Search in that embedding space. Anchor Positive Sample Negative Sample Subject #123 Subject #123 Subject #789
  • 7. SEITE 7 Re-Identification Study Attack Scenario 1. Organization releases a behavioral “anonymous” dataset D for period P1 2. Attacker obtains auxiliary behavioral data on user X for period P2 3. Attacker then attempts to re-identify user X within D via TL-RNN embeddings trained on D → Attacker reveals activities of user X within D Linkage Attacks rely on an overlap of the data points of the released and the auxiliary data. A fuzzy match on these data points allows for re-identification. “Anonymous” Released Data (=Netflix) Identified Auxiliary Data (IMBD) “Anonymous” Released Data Identified Auxiliary Data Pattern Attacks do not require an overlap of the data points, but merely of the data subjects of the released and the auxiliary data. A fuzzy match on the behavioral patterns of these data points allows for re-identification.
  • 8. Dataset ● Comscore Web Browser Panel = continuous tracking of browsing behavior ● Release January to June data for 4,000 active “anonymous” panelists ● Attempt to re-identify panelists based on their observed July data → no overlap in data points Subject #? Jul ● google.com ● google.com ● booking.com ● kayak.com ● cnn.com ● weather.com ● ... SEITE 8 Re-Identification Study Subject #1 Jan-Jun ● expedia.com ● kayak.com ● google.com ● kayak.com ● ups.com ● usatoday.com ● ... Subject #4000 Jan-Jun ● weather.com ● google.com ● usatoday.com ● google.com ● aol.com ● google.com ● ... Arnold Schwarzenegger
  • 9. Research Questions 1. Can we re-identify via pattern attack? 2. Can we protect with data perturbation? 3. Can we protect with data synthesis? SEITE 9 Re-Identification Study
  • 10. SEITE 10 Re-Identification Study 0.025% (=1/4000) are re-identified via random guess
  • 11. SEITE 11 Re-Identification Study 49.9% are re-identified via TL-RNN based Pattern Attack → Re-Identification on behavioral traits possible
  • 12. SEITE 12 Re-Identification Study 65.6% are within Top 5 candidates via TL-RNN based Pattern Attack → Re-Identification on behavioral traits possible
  • 13. SEITE 13 Re-Identification Study - Perturbation *We replaced any data point with 30% probability with a data point from any other subject. Highly destructive mechanism in an attempt to prevent re-identification. 26.6% are re-identified despite 30% of data points being replaced* → Re-Identification is robust against Noise
  • 14. SEITE 14 Re-Identification Study - Perturbation *We replaced any data point with 60% probability with a data point from any other subject. Highly destructive mechanism in an attempt to prevent re-identification. 1.1% are re-identified despite 60% of data points being replaced* → Re-Identification is robust against Noise
  • 15. Idea: Release Synthetic Data rather than (perturbated) Original Data ● Generative AI can provide representative datasets that strive to retain statistical properties ● no 1:1 link between actual and synthetic subjects → thus no re-identification possible ● BUT one still might leak information on individuals by memorization SEITE 15 Re-Identification Study - Synthetization How to Test Privacy of Synthetic Data? ● “Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data” (Platzer et al., forthcoming) ● Require that synthetic subjects are NOT systematically closer to training subjects than to holdout subjects
  • 16. Empirical Privacy Test ● Split 4,000 subjects into 2,000 training and 2,000 holdout ● Generate 2,000 synthetic subjects based on 2,000 training ● Check whether synthetic are any closer to training than to holdout based on TL-RNN embeddings SEITE 16 Re-Identification Study - Synthetization Results ● Avg Distance to Training: 0.731 ● Avg Distance to Holdout: 0.737 ● 51.8% are closer to training - 48.2% are closer to holdout
  • 17. SEITE 17 Summary ● Sharing of behavioral data is subject to GDPR ● AI-based re-identification on behavioral traits is possible ● Data perturbation does not protect your privacy ● Data synthesis can offer true anonymization