SlideShare a Scribd company logo
1 of 15
Re-identification of Anonymized CDR
datasets Using Social network Data
Alket Cecaj, Marco Mamei, Nicola Bicocchi
University of studies of Modena and Reggio Emilia
PerCom 2014
More data..big opportunities of study
Dataset join and privacy issues
• Matching different users associated to the same real
person.
• Privacy issues: any kind of information can be inferred
● Join different datasets is the key for advanced forms of
context awareness
Related work
Anonymization..
and re-identification
• Gender, ZIP and full date of birth 63% of re-identification
• movie ratings from NetFlix Prize dataset
• Medical records of Massachusetts Hospital using a voters list
• re-identification of anonymous volunteers in a DNA study for Personal
Genome Project
In line with our domain
• Unique in the Crowd: the privacy bounds of Human Mobility
• Markov chain models for de-anonymization of geo-located data
Dataset join and privacy issues.
• Can we use data from social networks to re-
identify users for an anonymized dataset
such as a CDR one?
• Probabilistic approach to evaluate the re-
identification potential.
CDR and Social Data sets
CDR and Social Dataset - Distribution of events
● CDR
● on average 28 events/period , max = 330, min = 3
● 2.019321 users for final analysis
● Social dataset
● on average 20 events/period , max = 424, min = 3
● 700 users for final analysis
Matching users among datasets
● Time and space parameters for matching for example 10min of time
interval between events and cell radius as physical distance
● Clone of social dataset in order to check/verify the quantity of matchings
that were done by chance following Bonferroni’s principle.
● Exclusion of CDR users making events in the same time but in a long
distance much bigger that the cell radius.
Convergence to one ?
Distributions and Percentages
Probabilistic modelling
Given FTa,
U discrete random variable, having NU values Ui
i= 1...N
Overall results
Conclusions
Potential and/or limits of re-identification of users across
multiple mobility datasets.
Future research:
• the current model and overall approach needs refinement
• privacy concerns though mechanisms for preserving privacy and
data utility for a single aspect
• correlation among data sets represents a big opportunity to enrich the
information available to a pervasive application
Thank you for your attention.
Questions are welcome.
Re-identification of Anomized CDR datasets using Social networlk Data

More Related Content

What's hot

New prediction method for data spreading in social networks based on machine ...
New prediction method for data spreading in social networks based on machine ...New prediction method for data spreading in social networks based on machine ...
New prediction method for data spreading in social networks based on machine ...TELKOMNIKA JOURNAL
 
세계산학관협력총회 Watef 패널을 공지합니다
세계산학관협력총회 Watef 패널을 공지합니다세계산학관협력총회 Watef 패널을 공지합니다
세계산학관협력총회 Watef 패널을 공지합니다Han Woo PARK
 
2008 Annual Review Presentation
2008 Annual Review Presentation2008 Annual Review Presentation
2008 Annual Review PresentationBang Dinh
 
Resume sima das
Resume sima dasResume sima das
Resume sima dasSima-Das
 
Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...
Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...
Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...tmra
 
Distributed Data mining using Multi Agent data
Distributed Data mining using Multi Agent dataDistributed Data mining using Multi Agent data
Distributed Data mining using Multi Agent dataIRJET Journal
 
ISWC 2016 Tutorial: Semantic Web of Things M3 framework & FIESTA-IoT EU project
ISWC 2016 Tutorial: Semantic Web of Things  M3 framework & FIESTA-IoT EU projectISWC 2016 Tutorial: Semantic Web of Things  M3 framework & FIESTA-IoT EU project
ISWC 2016 Tutorial: Semantic Web of Things M3 framework & FIESTA-IoT EU projectFIESTA-IoT
 
Inter-Organizational Crisis Management Infrastructures for Electrical Power B...
Inter-Organizational Crisis Management Infrastructures for Electrical Power B...Inter-Organizational Crisis Management Infrastructures for Electrical Power B...
Inter-Organizational Crisis Management Infrastructures for Electrical Power B...Torben Wiedenhoefer
 
Integrating Web Services With Geospatial Data Mining Disaster Management for ...
Integrating Web Services With Geospatial Data Mining Disaster Management for ...Integrating Web Services With Geospatial Data Mining Disaster Management for ...
Integrating Web Services With Geospatial Data Mining Disaster Management for ...Waqas Tariq
 
Mobile Sensors in the City
Mobile Sensors in the CityMobile Sensors in the City
Mobile Sensors in the CityNeal Lathia
 
Agent-Based Computing in the Internet of Things: a Survey. Claudio Savaglio, ...
Agent-Based Computing in the Internet of Things: a Survey. Claudio Savaglio, ...Agent-Based Computing in the Internet of Things: a Survey. Claudio Savaglio, ...
Agent-Based Computing in the Internet of Things: a Survey. Claudio Savaglio, ...Universita della Calabria,
 
A Survey On Ontology Agent Based Distributed Data Mining
A Survey On Ontology Agent Based Distributed Data MiningA Survey On Ontology Agent Based Distributed Data Mining
A Survey On Ontology Agent Based Distributed Data MiningEditor IJMTER
 
Reality Mining (Nathan Eagle)
Reality Mining (Nathan Eagle)Reality Mining (Nathan Eagle)
Reality Mining (Nathan Eagle)Jan Sifra
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...Edward Curry
 

What's hot (20)

"Grid Computing: BOINC Overview" por Rodrigo Neves, Nuno Mestre, Francisco Ma...
"Grid Computing: BOINC Overview" por Rodrigo Neves, Nuno Mestre, Francisco Ma..."Grid Computing: BOINC Overview" por Rodrigo Neves, Nuno Mestre, Francisco Ma...
"Grid Computing: BOINC Overview" por Rodrigo Neves, Nuno Mestre, Francisco Ma...
 
Dacena
DacenaDacena
Dacena
 
New prediction method for data spreading in social networks based on machine ...
New prediction method for data spreading in social networks based on machine ...New prediction method for data spreading in social networks based on machine ...
New prediction method for data spreading in social networks based on machine ...
 
세계산학관협력총회 Watef 패널을 공지합니다
세계산학관협력총회 Watef 패널을 공지합니다세계산학관협력총회 Watef 패널을 공지합니다
세계산학관협력총회 Watef 패널을 공지합니다
 
2008 Annual Review Presentation
2008 Annual Review Presentation2008 Annual Review Presentation
2008 Annual Review Presentation
 
Resume sima das
Resume sima dasResume sima das
Resume sima das
 
Dotnet ieee titles 2013 14
Dotnet ieee titles 2013 14Dotnet ieee titles 2013 14
Dotnet ieee titles 2013 14
 
Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...
Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...
Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...
 
Distributed Data mining using Multi Agent data
Distributed Data mining using Multi Agent dataDistributed Data mining using Multi Agent data
Distributed Data mining using Multi Agent data
 
ISWC 2016 Tutorial: Semantic Web of Things M3 framework & FIESTA-IoT EU project
ISWC 2016 Tutorial: Semantic Web of Things  M3 framework & FIESTA-IoT EU projectISWC 2016 Tutorial: Semantic Web of Things  M3 framework & FIESTA-IoT EU project
ISWC 2016 Tutorial: Semantic Web of Things M3 framework & FIESTA-IoT EU project
 
Inter-Organizational Crisis Management Infrastructures for Electrical Power B...
Inter-Organizational Crisis Management Infrastructures for Electrical Power B...Inter-Organizational Crisis Management Infrastructures for Electrical Power B...
Inter-Organizational Crisis Management Infrastructures for Electrical Power B...
 
Integrating Web Services With Geospatial Data Mining Disaster Management for ...
Integrating Web Services With Geospatial Data Mining Disaster Management for ...Integrating Web Services With Geospatial Data Mining Disaster Management for ...
Integrating Web Services With Geospatial Data Mining Disaster Management for ...
 
Artemenko-poster
Artemenko-posterArtemenko-poster
Artemenko-poster
 
Mobile Sensors in the City
Mobile Sensors in the CityMobile Sensors in the City
Mobile Sensors in the City
 
Data Models and the DMCA
Data Models and the DMCAData Models and the DMCA
Data Models and the DMCA
 
B08 B4pc 141 Diapo Amiotte En
B08 B4pc 141 Diapo Amiotte EnB08 B4pc 141 Diapo Amiotte En
B08 B4pc 141 Diapo Amiotte En
 
Agent-Based Computing in the Internet of Things: a Survey. Claudio Savaglio, ...
Agent-Based Computing in the Internet of Things: a Survey. Claudio Savaglio, ...Agent-Based Computing in the Internet of Things: a Survey. Claudio Savaglio, ...
Agent-Based Computing in the Internet of Things: a Survey. Claudio Savaglio, ...
 
A Survey On Ontology Agent Based Distributed Data Mining
A Survey On Ontology Agent Based Distributed Data MiningA Survey On Ontology Agent Based Distributed Data Mining
A Survey On Ontology Agent Based Distributed Data Mining
 
Reality Mining (Nathan Eagle)
Reality Mining (Nathan Eagle)Reality Mining (Nathan Eagle)
Reality Mining (Nathan Eagle)
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
 

Viewers also liked

Asterisk (IP-PBX) CDR Log Rotation
Asterisk (IP-PBX) CDR Log RotationAsterisk (IP-PBX) CDR Log Rotation
Asterisk (IP-PBX) CDR Log RotationWilliam Lee
 
Social network analysis using Mobile phone data
Social network analysis using Mobile phone dataSocial network analysis using Mobile phone data
Social network analysis using Mobile phone dataAntónio Oliveira
 
Cygwin Install How-To (Chinese)
Cygwin Install How-To (Chinese)Cygwin Install How-To (Chinese)
Cygwin Install How-To (Chinese)William Lee
 
Timing over packet demarcation
Timing over packet demarcationTiming over packet demarcation
Timing over packet demarcationNir Cohen
 
Hello world在那邊?背景說明
Hello world在那邊?背景說明Hello world在那邊?背景說明
Hello world在那邊?背景說明Wen Liao
 
GNU AS簡介
GNU AS簡介GNU AS簡介
GNU AS簡介Wen Liao
 
How to use phone calls and network analysis to identify criminals
How to use phone calls and network analysis to identify criminals How to use phone calls and network analysis to identify criminals
How to use phone calls and network analysis to identify criminals Linkurious
 
UPnP 1.0 簡介
UPnP 1.0 簡介UPnP 1.0 簡介
UPnP 1.0 簡介Wen Liao
 
Internationalization(i18n) of Web Page
Internationalization(i18n) of Web PageInternationalization(i18n) of Web Page
Internationalization(i18n) of Web PageWilliam Lee
 
Trace 程式碼之皮
Trace 程式碼之皮Trace 程式碼之皮
Trace 程式碼之皮Wen Liao
 
GNU ld的linker script簡介
GNU ld的linker script簡介GNU ld的linker script簡介
GNU ld的linker script簡介Wen Liao
 
A successful git branching model 導讀
A successful git branching model 導讀A successful git branching model 導讀
A successful git branching model 導讀Wen Liao
 
Streaming Media Server Setup Manual
Streaming Media Server Setup ManualStreaming Media Server Setup Manual
Streaming Media Server Setup ManualWilliam Lee
 
C++ idioms by example (Nov 2008)
C++ idioms by example (Nov 2008)C++ idioms by example (Nov 2008)
C++ idioms by example (Nov 2008)Olve Maudal
 
Solid C++ by Example
Solid C++ by ExampleSolid C++ by Example
Solid C++ by ExampleOlve Maudal
 
Insecure coding in C (and C++)
Insecure coding in C (and C++)Insecure coding in C (and C++)
Insecure coding in C (and C++)Olve Maudal
 
Introdunction to Network Management Protocols - SNMP & TR-069
Introdunction to Network Management Protocols - SNMP & TR-069Introdunction to Network Management Protocols - SNMP & TR-069
Introdunction to Network Management Protocols - SNMP & TR-069William Lee
 

Viewers also liked (20)

Asterisk (IP-PBX) CDR Log Rotation
Asterisk (IP-PBX) CDR Log RotationAsterisk (IP-PBX) CDR Log Rotation
Asterisk (IP-PBX) CDR Log Rotation
 
Social network analysis using Mobile phone data
Social network analysis using Mobile phone dataSocial network analysis using Mobile phone data
Social network analysis using Mobile phone data
 
Cygwin Install How-To (Chinese)
Cygwin Install How-To (Chinese)Cygwin Install How-To (Chinese)
Cygwin Install How-To (Chinese)
 
Timing over packet demarcation
Timing over packet demarcationTiming over packet demarcation
Timing over packet demarcation
 
Hello world在那邊?背景說明
Hello world在那邊?背景說明Hello world在那邊?背景說明
Hello world在那邊?背景說明
 
GNU AS簡介
GNU AS簡介GNU AS簡介
GNU AS簡介
 
How to use phone calls and network analysis to identify criminals
How to use phone calls and network analysis to identify criminals How to use phone calls and network analysis to identify criminals
How to use phone calls and network analysis to identify criminals
 
from Source to Binary: How GNU Toolchain Works
from Source to Binary: How GNU Toolchain Worksfrom Source to Binary: How GNU Toolchain Works
from Source to Binary: How GNU Toolchain Works
 
UPnP 1.0 簡介
UPnP 1.0 簡介UPnP 1.0 簡介
UPnP 1.0 簡介
 
Internationalization(i18n) of Web Page
Internationalization(i18n) of Web PageInternationalization(i18n) of Web Page
Internationalization(i18n) of Web Page
 
Trace 程式碼之皮
Trace 程式碼之皮Trace 程式碼之皮
Trace 程式碼之皮
 
GNU ld的linker script簡介
GNU ld的linker script簡介GNU ld的linker script簡介
GNU ld的linker script簡介
 
A successful git branching model 導讀
A successful git branching model 導讀A successful git branching model 導讀
A successful git branching model 導讀
 
Streaming Media Server Setup Manual
Streaming Media Server Setup ManualStreaming Media Server Setup Manual
Streaming Media Server Setup Manual
 
C++ idioms by example (Nov 2008)
C++ idioms by example (Nov 2008)C++ idioms by example (Nov 2008)
C++ idioms by example (Nov 2008)
 
Solid C++ by Example
Solid C++ by ExampleSolid C++ by Example
Solid C++ by Example
 
How A Compiler Works: GNU Toolchain
How A Compiler Works: GNU ToolchainHow A Compiler Works: GNU Toolchain
How A Compiler Works: GNU Toolchain
 
Insecure coding in C (and C++)
Insecure coding in C (and C++)Insecure coding in C (and C++)
Insecure coding in C (and C++)
 
MTP & PTP
MTP & PTPMTP & PTP
MTP & PTP
 
Introdunction to Network Management Protocols - SNMP & TR-069
Introdunction to Network Management Protocols - SNMP & TR-069Introdunction to Network Management Protocols - SNMP & TR-069
Introdunction to Network Management Protocols - SNMP & TR-069
 

Similar to Re-identification of Anomized CDR datasets using Social networlk Data

Presentation of PhD thesis on Location Data Fusion
Presentation of PhD thesis on Location Data Fusion Presentation of PhD thesis on Location Data Fusion
Presentation of PhD thesis on Location Data Fusion Alket Cecaj
 
A Novel Frame Work System Used In Mobile with Cloud Based Environment
A Novel Frame Work System Used In Mobile with Cloud Based EnvironmentA Novel Frame Work System Used In Mobile with Cloud Based Environment
A Novel Frame Work System Used In Mobile with Cloud Based Environmentpaperpublications3
 
"Melting Pot" of the Sciences in interdisciplinary research
"Melting Pot" of the Sciences in interdisciplinary research"Melting Pot" of the Sciences in interdisciplinary research
"Melting Pot" of the Sciences in interdisciplinary researchNatalie de Vries
 
The role of libraries and information professionals during the Big Data Era/ ...
The role of libraries and information professionals during the Big Data Era/ ...The role of libraries and information professionals during the Big Data Era/ ...
The role of libraries and information professionals during the Big Data Era/ ...African Open Science Platform
 
Will Data Science Approaches Impact Our Science?
Will Data Science Approaches Impact Our Science?Will Data Science Approaches Impact Our Science?
Will Data Science Approaches Impact Our Science?Philip Bourne
 
eCitizen Sensible-Data Design Challenge
eCitizen Sensible-Data Design ChallengeeCitizen Sensible-Data Design Challenge
eCitizen Sensible-Data Design Challengehopbeat
 
CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...Johann van Wyk
 
160905 tryggve-at-eccb pursula
160905 tryggve-at-eccb pursula160905 tryggve-at-eccb pursula
160905 tryggve-at-eccb pursulaanttipursula
 
Enabling the physical world to the Internet and potential benefits for agricu...
Enabling the physical world to the Internet and potential benefits for agricu...Enabling the physical world to the Internet and potential benefits for agricu...
Enabling the physical world to the Internet and potential benefits for agricu...Andreas Kamilaris
 
Information extraction from sensor networks using the Watershed transform alg...
Information extraction from sensor networks using the Watershed transform alg...Information extraction from sensor networks using the Watershed transform alg...
Information extraction from sensor networks using the Watershed transform alg...M H
 
CS6010 Social Network Analysis Unit IV
CS6010 Social Network Analysis Unit IVCS6010 Social Network Analysis Unit IV
CS6010 Social Network Analysis Unit IVpkaviya
 
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Lauri Eloranta
 
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...CILIP MDG
 
PerCol 2012 - Presentation
PerCol 2012 - Presentation PerCol 2012 - Presentation
PerCol 2012 - Presentation Ville Antila
 
A Lifecycle Approach to Information Privacy
A Lifecycle Approach to Information PrivacyA Lifecycle Approach to Information Privacy
A Lifecycle Approach to Information PrivacyMicah Altman
 
Data Anonymization for Privacy Preservation in Big Data
Data Anonymization for Privacy Preservation in Big DataData Anonymization for Privacy Preservation in Big Data
Data Anonymization for Privacy Preservation in Big Datarahulmonikasharma
 

Similar to Re-identification of Anomized CDR datasets using Social networlk Data (20)

Presentation of PhD thesis on Location Data Fusion
Presentation of PhD thesis on Location Data Fusion Presentation of PhD thesis on Location Data Fusion
Presentation of PhD thesis on Location Data Fusion
 
Chapter 16
Chapter 16Chapter 16
Chapter 16
 
A Novel Frame Work System Used In Mobile with Cloud Based Environment
A Novel Frame Work System Used In Mobile with Cloud Based EnvironmentA Novel Frame Work System Used In Mobile with Cloud Based Environment
A Novel Frame Work System Used In Mobile with Cloud Based Environment
 
"Melting Pot" of the Sciences in interdisciplinary research
"Melting Pot" of the Sciences in interdisciplinary research"Melting Pot" of the Sciences in interdisciplinary research
"Melting Pot" of the Sciences in interdisciplinary research
 
The role of libraries and information professionals during the Big Data Era/ ...
The role of libraries and information professionals during the Big Data Era/ ...The role of libraries and information professionals during the Big Data Era/ ...
The role of libraries and information professionals during the Big Data Era/ ...
 
Will Data Science Approaches Impact Our Science?
Will Data Science Approaches Impact Our Science?Will Data Science Approaches Impact Our Science?
Will Data Science Approaches Impact Our Science?
 
eCitizen Sensible-Data Design Challenge
eCitizen Sensible-Data Design ChallengeeCitizen Sensible-Data Design Challenge
eCitizen Sensible-Data Design Challenge
 
CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...
 
DATAIA & TransAlgo
DATAIA & TransAlgoDATAIA & TransAlgo
DATAIA & TransAlgo
 
160905 tryggve-at-eccb pursula
160905 tryggve-at-eccb pursula160905 tryggve-at-eccb pursula
160905 tryggve-at-eccb pursula
 
Enabling the physical world to the Internet and potential benefits for agricu...
Enabling the physical world to the Internet and potential benefits for agricu...Enabling the physical world to the Internet and potential benefits for agricu...
Enabling the physical world to the Internet and potential benefits for agricu...
 
Information extraction from sensor networks using the Watershed transform alg...
Information extraction from sensor networks using the Watershed transform alg...Information extraction from sensor networks using the Watershed transform alg...
Information extraction from sensor networks using the Watershed transform alg...
 
CS6010 Social Network Analysis Unit IV
CS6010 Social Network Analysis Unit IVCS6010 Social Network Analysis Unit IV
CS6010 Social Network Analysis Unit IV
 
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
 
DBMS
DBMSDBMS
DBMS
 
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
 
PerCol 2012 - Presentation
PerCol 2012 - Presentation PerCol 2012 - Presentation
PerCol 2012 - Presentation
 
A Lifecycle Approach to Information Privacy
A Lifecycle Approach to Information PrivacyA Lifecycle Approach to Information Privacy
A Lifecycle Approach to Information Privacy
 
Data Anonymization for Privacy Preservation in Big Data
Data Anonymization for Privacy Preservation in Big DataData Anonymization for Privacy Preservation in Big Data
Data Anonymization for Privacy Preservation in Big Data
 
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
 

More from Alket Cecaj

Distributed systems and blockchain technology
Distributed systems and blockchain technologyDistributed systems and blockchain technology
Distributed systems and blockchain technologyAlket Cecaj
 
Elaborazione e rappresentazione grafica e interattiva dell'informazione
Elaborazione e rappresentazione grafica e interattiva dell'informazioneElaborazione e rappresentazione grafica e interattiva dell'informazione
Elaborazione e rappresentazione grafica e interattiva dell'informazioneAlket Cecaj
 
Collective awareness for human ict collaboration in smart cities
Collective awareness for human ict collaboration in smart citiesCollective awareness for human ict collaboration in smart cities
Collective awareness for human ict collaboration in smart citiesAlket Cecaj
 
Algorithms presentation
Algorithms presentationAlgorithms presentation
Algorithms presentationAlket Cecaj
 
Bridges innovcampdk
Bridges innovcampdkBridges innovcampdk
Bridges innovcampdkAlket Cecaj
 

More from Alket Cecaj (6)

Distributed systems and blockchain technology
Distributed systems and blockchain technologyDistributed systems and blockchain technology
Distributed systems and blockchain technology
 
Joomla
Joomla Joomla
Joomla
 
Elaborazione e rappresentazione grafica e interattiva dell'informazione
Elaborazione e rappresentazione grafica e interattiva dell'informazioneElaborazione e rappresentazione grafica e interattiva dell'informazione
Elaborazione e rappresentazione grafica e interattiva dell'informazione
 
Collective awareness for human ict collaboration in smart cities
Collective awareness for human ict collaboration in smart citiesCollective awareness for human ict collaboration in smart cities
Collective awareness for human ict collaboration in smart cities
 
Algorithms presentation
Algorithms presentationAlgorithms presentation
Algorithms presentation
 
Bridges innovcampdk
Bridges innovcampdkBridges innovcampdk
Bridges innovcampdk
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 

Re-identification of Anomized CDR datasets using Social networlk Data

  • 1. Re-identification of Anonymized CDR datasets Using Social network Data Alket Cecaj, Marco Mamei, Nicola Bicocchi University of studies of Modena and Reggio Emilia PerCom 2014
  • 3. Dataset join and privacy issues • Matching different users associated to the same real person. • Privacy issues: any kind of information can be inferred ● Join different datasets is the key for advanced forms of context awareness
  • 4. Related work Anonymization.. and re-identification • Gender, ZIP and full date of birth 63% of re-identification • movie ratings from NetFlix Prize dataset • Medical records of Massachusetts Hospital using a voters list • re-identification of anonymous volunteers in a DNA study for Personal Genome Project In line with our domain • Unique in the Crowd: the privacy bounds of Human Mobility • Markov chain models for de-anonymization of geo-located data
  • 5. Dataset join and privacy issues. • Can we use data from social networks to re- identify users for an anonymized dataset such as a CDR one? • Probabilistic approach to evaluate the re- identification potential.
  • 6. CDR and Social Data sets
  • 7. CDR and Social Dataset - Distribution of events ● CDR ● on average 28 events/period , max = 330, min = 3 ● 2.019321 users for final analysis ● Social dataset ● on average 20 events/period , max = 424, min = 3 ● 700 users for final analysis
  • 8. Matching users among datasets ● Time and space parameters for matching for example 10min of time interval between events and cell radius as physical distance ● Clone of social dataset in order to check/verify the quantity of matchings that were done by chance following Bonferroni’s principle. ● Exclusion of CDR users making events in the same time but in a long distance much bigger that the cell radius.
  • 11. Probabilistic modelling Given FTa, U discrete random variable, having NU values Ui i= 1...N
  • 13. Conclusions Potential and/or limits of re-identification of users across multiple mobility datasets. Future research: • the current model and overall approach needs refinement • privacy concerns though mechanisms for preserving privacy and data utility for a single aspect • correlation among data sets represents a big opportunity to enrich the information available to a pervasive application
  • 14. Thank you for your attention. Questions are welcome.

Editor's Notes

  1. My name is Alket Cecaj and I’m a PhD student at the University of studies of Modena and Reggio Emilia. In this work which has been done together with my supervisor Marco Mamei, and with Nicola Bicocchi we examine a large dataset of 335 million, anonymized call records made by 3 million users during a period of 47 days in a region of northern Italy. By combining this dataset with publicly available data from social networks such as twitter and flickr we present a probabilistic approach in order to evaluate the potential of re-identification of the anonymized dataset.
  2. As mobile devices and internet become available also a vast quantity of data is generated. In particular mobile telecom companies have the possibility of monitoring a large number of terminals as they connect to the network through collecting CDRs (Call Description Records). There is also publically available data from social networks such as twitter or flickr. Those services collect geo-referenced data about their users and make it available through their REST API services. This gives the possibility to infer people presence or actions in a determined context and study human and crowd behavior in a large scale.
  3. Obviously having more data or enriching existent data with other information enables interesting applications.For example it would be interesting to know if user X in the CDR dataset is actually the same user Y from the Twitter user data and then join the two datasets. The matching process is straightforward and consists in identifying if CDR user X and Twitter user Y consistently produced data at the same time and place and once enough geo-referenced elements overlap we can be reasonably sure that users are actually the same. The dark side of the moon is that merging dataset could raise privacy issues as relations between different types of data in particular geo-referenced data can be used to infer socio-economic status, mobility and shopping patterns or even user’s social graph. On the other hand combining different datasets is a key enabler for advanced context-awareness.
  4. The related work can be divided in two parts that are complementary. On one hand the data anonymization (in particular k-anonymity technique that means making a person indistinguishable from at least k users.) and on the other data re-identification So as anonymized data is available to researchers there is a considerable amount of works on data re-identification. Starting with some early works there is census re-identification by knowing 1-gender, ZIP and full date of birth allows for 63% of re-identification 2-re-identification of users in NetFlix Prize movie ratings dataset that NetFlix released for improving it’s recommendation system where the users where re-identified by relating their movie preferences or ratings with side information from IMDb 3-Medical records of Massachusetts Hospital using a voters list 4-re-identification of anonymous volunteers in a DNA study for Personal Genome Project More similar to our work are : unique in the crowd that analyzes mobility traces from CDR data in which the authors say that 4 geo-referenced points are enough for identifying up to 90 % of the CDR users.
  5. So our research purpose during this work was that of experimenting in this direction asking the following question (bullet point 1). and subsequently evaluate the potential of re-identification.
  6. CDR data consists in records or events made by a mobile device (such as incoming/outgoing calls, text messages and data transmission for Internet connections), timestamp and coordinates of the cell tower handling the event.. Social dataset is also made of records having an identifier(name or nickname), description of pic or tweet, coordinates and event timestamp.
  7. In a) (left side) there is the distribution of events generated by 3 million CDR users with an average of about 28. At your right there is the distribution of Twitter/Flickr users. At the beginning we considered a pool of 810 user from which we decided to choose 700 of them. Basically we excluded users which had done too many events or very few events .
  8. Combinatory approach trying to match (by time and space) every user from the first dataset with every other user in the second dataset. For example we had a match if the temporal distance between the events of the user X from the Flickr/Twitter dataset and the user Y from the CDR dataset was less than 10 minutes, and their physical distance was less than the radius of the cell tower handling the CDR event of Y.
  9. Considering the social user FTa (in black) producing data during a time interval in different moments t1, t2, t3 and t4 (starting from the left side and moving to the right), and considering the CDR users C1, C2, C3 and C4 we can built the following matchings as by figure. We can exclude C3 as this user produced data in the same interval of time but at a distance d >> r which is the radius of the cell. Between C1, C2 and C4 the best candidate is C2 which has a better overlapping, while C1 and C4 are lacking some data but still we can not exclude them.
  10. This slide presents some statistics of the quantity of matchings we found and their distribution. At the left there is a boxplot diagram summarizing the statistics of the number of CDR users (for a better graphical representation the y axis is in logarithmic scale) having x matching events with FT users. In the right side we have plotted the percentage of FT users that can be associated to x number of CDR users. Or course it is not possible to be completely sure about these users and for dealing with those kind of matchings we use a probabilistic approach that will be illustrated in the next slide.
  11. The probabilistic modelling tries to answer the question : given that the CDR user C2 has n events matching with FTa how likely it is that the two users are the same? In other words how likely it is that we actually de-anonymized the CDR user C2? We choose this approach not only because we had data from only one carrier but also because the number of possible matchings(or matching events) is really high and at the end not all the CDR users can be excluded with respect to the social user i. So given the FTa user(which is our social user), we consider a discrete random variable U having Nu values Ui (with i that goes from one to N) associated to the people that could be the user FTa. This way a subset of U will be associated also to our CDR users. Theta_i is the probability that two users(each from different datasets) are the same person. Then we can assume that the probability mass function associated to U can be modelled as a Dirichlet distribution where we set each alfa_i equal to one over Nu. So if our social user matches with 10 CDR users that each of them has the same probability (one tenth) 1/10. If a CDR user falls in the exclusion condition illustrated in the previous slide then we set alpha_ i = 0. Then we count the number of times each CDR users produces events matching the events of social user as M and following the Bayes rule update the posterior probability as the conditional probability of theta given M. At the end there will be a single most probable hypothesis or Maximum a Posteriori theta_i MAP
  12. Having considered only users having more than one match for each FT user we compute the probability of matching a CDR user. Figure a) left side, illustrates the results for a CRD-FT re-identification and it shows that the CDR user “0de7f” has a high probability and a large gap with other CDR users and even we don’t have ground truth evidence this large gap suggests the conclusion that the social user 1278644 is the same person as the CDR with whom it has such a large probability. In fig b) are shown the overall results where for each social user we compute the probability of top matching CDR user and then we count the number of CDR that are re-identified with a given probability and in this case with probability larger than 0.1. There are 260 social users we re-identified and this number is about one third of the social dataset we considered.
  13. Model based on a number of independency assumptions that can be hardly justified in the real world. Also the random variable being used tend to have a large number of possible outcomes and the overall probability distribution remains low even after a large number of matching events. Privacy concerns are the main impeding factor to prevent CDR data to be applied in pervasive applications but we believe that a viable approach can be that of a mechanism of differential anonymization that could preserve privacy without destroying the utility of the dataset for a single aspect that is the one useful for the specific application. Correlation among datasets represents a big opportunity to enrich the information available to a pervasive application for the achievement of pervasive computing vision.