Re-identification of Anomized CDR datasets using Social networlk Data

•Download as PPTX, PDF•

1 like•559 views

Alket Cecaj

Technology

Re-identification of Anonymized CDR
datasets Using Social network Data
Alket Cecaj, Marco Mamei, Nicola Bicocchi
University of studies of Modena and Reggio Emilia
PerCom 2014

Dataset join and privacy issues
• Matching different users associated to the same real
person.
• Privacy issues: any kind of information can be inferred
● Join different datasets is the key for advanced forms of
context awareness

Related work
Anonymization..
and re-identification
• Gender, ZIP and full date of birth 63% of re-identification
• movie ratings from NetFlix Prize dataset
• Medical records of Massachusetts Hospital using a voters list
• re-identification of anonymous volunteers in a DNA study for Personal
Genome Project
In line with our domain
• Unique in the Crowd: the privacy bounds of Human Mobility
• Markov chain models for de-anonymization of geo-located data

Dataset join and privacy issues.
• Can we use data from social networks to re-
identify users for an anonymized dataset
such as a CDR one?
• Probabilistic approach to evaluate the re-
identification potential.

CDR and Social Dataset - Distribution of events
● CDR
● on average 28 events/period , max = 330, min = 3
● 2.019321 users for final analysis
● Social dataset
● on average 20 events/period , max = 424, min = 3
● 700 users for final analysis

Matching users among datasets
● Time and space parameters for matching for example 10min of time
interval between events and cell radius as physical distance
● Clone of social dataset in order to check/verify the quantity of matchings
that were done by chance following Bonferroni’s principle.
● Exclusion of CDR users making events in the same time but in a long
distance much bigger that the cell radius.

Probabilistic modelling
Given FTa,
U discrete random variable, having NU values Ui
i= 1...N

Conclusions
Potential and/or limits of re-identification of users across
multiple mobility datasets.
Future research:
• the current model and overall approach needs refinement
• privacy concerns though mechanisms for preserving privacy and
data utility for a single aspect
• correlation among data sets represents a big opportunity to enrich the
information available to a pervasive application

Thank you for your attention.
Questions are welcome.

Re-identification of Anomized CDR datasets using Social networlk Data

What's hot

"Grid Computing: BOINC Overview" por Rodrigo Neves, Nuno Mestre, Francisco Ma...Núcleo de Electrónica e Informática da Universidade do Algarve

Dacenamiss-lab

New prediction method for data spreading in social networks based on machine ...TELKOMNIKA JOURNAL

세계산학관협력총회 Watef 패널을 공지합니다Han Woo PARK

2008 Annual Review PresentationBang Dinh

Resume sima dasSima-Das

Dotnet ieee titles 2013 14S3 Infotech IEEE Projects

Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...tmra

Distributed Data mining using Multi Agent dataIRJET Journal

ISWC 2016 Tutorial: Semantic Web of Things M3 framework & FIESTA-IoT EU projectFIESTA-IoT

Inter-Organizational Crisis Management Infrastructures for Electrical Power B...Torben Wiedenhoefer

Integrating Web Services With Geospatial Data Mining Disaster Management for ...Waqas Tariq

Artemenko-posterВиктор Артеменко

Mobile Sensors in the CityNeal Lathia

Data Models and the DMCAprofessormadison

B08 B4pc 141 Diapo Amiotte EnTerritorial Intelligence

Agent-Based Computing in the Internet of Things: a Survey. Claudio Savaglio, ...Universita della Calabria,

A Survey On Ontology Agent Based Distributed Data MiningEditor IJMTER

Reality Mining (Nathan Eagle)Jan Sifra

From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...Edward Curry

What's hot (20)

"Grid Computing: BOINC Overview" por Rodrigo Neves, Nuno Mestre, Francisco Ma...

Dacena

New prediction method for data spreading in social networks based on machine ...

세계산학관협력총회 Watef 패널을 공지합니다

2008 Annual Review Presentation

Resume sima das

Dotnet ieee titles 2013 14

Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...

Distributed Data mining using Multi Agent data

ISWC 2016 Tutorial: Semantic Web of Things M3 framework & FIESTA-IoT EU project

Inter-Organizational Crisis Management Infrastructures for Electrical Power B...

Integrating Web Services With Geospatial Data Mining Disaster Management for ...

Artemenko-poster

Mobile Sensors in the City

Data Models and the DMCA

B08 B4pc 141 Diapo Amiotte En

Agent-Based Computing in the Internet of Things: a Survey. Claudio Savaglio, ...

A Survey On Ontology Agent Based Distributed Data Mining

Reality Mining (Nathan Eagle)

From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...

Viewers also liked

Asterisk (IP-PBX) CDR Log RotationWilliam Lee

Social network analysis using Mobile phone dataAntónio Oliveira

Cygwin Install How-To (Chinese)William Lee

Timing over packet demarcationNir Cohen

Hello world在那邊？背景說明Wen Liao

GNU AS簡介Wen Liao

How to use phone calls and network analysis to identify criminals Linkurious

from Source to Binary: How GNU Toolchain WorksNational Cheng Kung University

UPnP 1.0 簡介Wen Liao

Internationalization(i18n) of Web PageWilliam Lee

Trace 程式碼之皮Wen Liao

GNU ld的linker script簡介Wen Liao

A successful git branching model 導讀Wen Liao

Streaming Media Server Setup ManualWilliam Lee

C++ idioms by example (Nov 2008)Olve Maudal

Solid C++ by ExampleOlve Maudal

How A Compiler Works: GNU ToolchainNational Cheng Kung University

Insecure coding in C (and C++)Olve Maudal

MTP & PTPWilliam Lee

Introdunction to Network Management Protocols - SNMP & TR-069William Lee

Viewers also liked (20)

Asterisk (IP-PBX) CDR Log Rotation

Social network analysis using Mobile phone data

Cygwin Install How-To (Chinese)

Timing over packet demarcation

Hello world在那邊？背景說明

GNU AS簡介

How to use phone calls and network analysis to identify criminals

from Source to Binary: How GNU Toolchain Works

UPnP 1.0 簡介

Internationalization(i18n) of Web Page

Trace 程式碼之皮

GNU ld的linker script簡介

A successful git branching model 導讀

Streaming Media Server Setup Manual

C++ idioms by example (Nov 2008)

Solid C++ by Example

How A Compiler Works: GNU Toolchain

Insecure coding in C (and C++)

MTP & PTP

Introdunction to Network Management Protocols - SNMP & TR-069

Similar to Re-identification of Anomized CDR datasets using Social networlk Data

Presentation of PhD thesis on Location Data Fusion Alket Cecaj

Chapter 16Webometrics Class

A Novel Frame Work System Used In Mobile with Cloud Based Environmentpaperpublications3

"Melting Pot" of the Sciences in interdisciplinary researchNatalie de Vries

The role of libraries and information professionals during the Big Data Era/ ...African Open Science Platform

Will Data Science Approaches Impact Our Science?Philip Bourne

eCitizen Sensible-Data Design Challengehopbeat

CODATA International Training Workshop in Big Data for Science for Researcher...Johann van Wyk

DATAIA & TransAlgoNozha Boujemaa

160905 tryggve-at-eccb pursulaanttipursula

Enabling the physical world to the Internet and potential benefits for agricu...Andreas Kamilaris

Information extraction from sensor networks using the Watershed transform alg...M H

CS6010 Social Network Analysis Unit IVpkaviya

Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Lauri Eloranta

DBMSKathirvel Ayyaswamy

Managing 'Big Data' in the social sciences: the contribution of an analytico-...CILIP MDG

PerCol 2012 - Presentation Ville Antila

A Lifecycle Approach to Information PrivacyMicah Altman

Data Anonymization for Privacy Preservation in Big Datarahulmonikasharma

Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...Laboratorio di Cultura Digitale, labcd.humnet.unipi.it

Similar to Re-identification of Anomized CDR datasets using Social networlk Data (20)

Presentation of PhD thesis on Location Data Fusion

Chapter 16

A Novel Frame Work System Used In Mobile with Cloud Based Environment

"Melting Pot" of the Sciences in interdisciplinary research

The role of libraries and information professionals during the Big Data Era/ ...

Will Data Science Approaches Impact Our Science?

eCitizen Sensible-Data Design Challenge

CODATA International Training Workshop in Big Data for Science for Researcher...

DATAIA & TransAlgo

160905 tryggve-at-eccb pursula

Enabling the physical world to the Internet and potential benefits for agricu...

Information extraction from sensor networks using the Watershed transform alg...

CS6010 Social Network Analysis Unit IV

Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...

DBMS

Managing 'Big Data' in the social sciences: the contribution of an analytico-...

PerCol 2012 - Presentation

A Lifecycle Approach to Information Privacy

Data Anonymization for Privacy Preservation in Big Data

Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...

Recently uploaded

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Install Stable Diffusion in windows machinePadma Pradeep

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

"ML in Production",Oleksandr BaganFwdays

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

CloudStudio User manual (basic edition):comworks

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Bluetooth Controlled Car with Arduino.pdfngoud9212

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

APIForce Zurich 5 April Automation LPDGMarianaLemus7

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation

Install Stable Diffusion in windows machine

Designing IA for AI - Information Architecture Conference 2024

Unleash Your Potential - Namagunga Girls Coding Club

Unblocking The Main Thread Solving ANRs and Frozen Frames

Vertex AI Gemini Prompt Engineering Tips

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Dev Dives: Streamline document processing with UiPath Studio Web

Streamlining Python Development: A Guide to a Modern Project Setup

"ML in Production",Oleksandr Bagan

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

CloudStudio User manual (basic edition):

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

"Debugging python applications inside k8s environment", Andrii Soldatenko

Bluetooth Controlled Car with Arduino.pdf

Human Factors of XR: Using Human Factors to Design XR Systems

SQL Database Design For Developers at php[tek] 2024

APIForce Zurich 5 April Automation LPDG

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Science&tech:THE INFORMATION AGE STS.pdf

Re-identification of Anomized CDR datasets using Social networlk Data

1. Re-identification of Anonymized CDR datasets Using Social network Data Alket Cecaj, Marco Mamei, Nicola Bicocchi University of studies of Modena and Reggio Emilia PerCom 2014

2. More data..big opportunities of study

3. Dataset join and privacy issues • Matching different users associated to the same real person. • Privacy issues: any kind of information can be inferred ● Join different datasets is the key for advanced forms of context awareness

4. Related work Anonymization.. and re-identification • Gender, ZIP and full date of birth 63% of re-identification • movie ratings from NetFlix Prize dataset • Medical records of Massachusetts Hospital using a voters list • re-identification of anonymous volunteers in a DNA study for Personal Genome Project In line with our domain • Unique in the Crowd: the privacy bounds of Human Mobility • Markov chain models for de-anonymization of geo-located data

5. Dataset join and privacy issues. • Can we use data from social networks to re- identify users for an anonymized dataset such as a CDR one? • Probabilistic approach to evaluate the re- identification potential.

6. CDR and Social Data sets

7. CDR and Social Dataset - Distribution of events ● CDR ● on average 28 events/period , max = 330, min = 3 ● 2.019321 users for final analysis ● Social dataset ● on average 20 events/period , max = 424, min = 3 ● 700 users for final analysis

8. Matching users among datasets ● Time and space parameters for matching for example 10min of time interval between events and cell radius as physical distance ● Clone of social dataset in order to check/verify the quantity of matchings that were done by chance following Bonferroni’s principle. ● Exclusion of CDR users making events in the same time but in a long distance much bigger that the cell radius.

9. Convergence to one ?

10. Distributions and Percentages

11. Probabilistic modelling Given FTa, U discrete random variable, having NU values Ui i= 1...N

12. Overall results

13. Conclusions Potential and/or limits of re-identification of users across multiple mobility datasets. Future research: • the current model and overall approach needs refinement • privacy concerns though mechanisms for preserving privacy and data utility for a single aspect • correlation among data sets represents a big opportunity to enrich the information available to a pervasive application

14. Thank you for your attention. Questions are welcome.

Editor's Notes

My name is Alket Cecaj and I’m a PhD student at the University of studies of Modena and Reggio Emilia. In this work which has been done together with my supervisor Marco Mamei, and with Nicola Bicocchi we examine a large dataset of 335 million, anonymized call records made by 3 million users during a period of 47 days in a region of northern Italy. By combining this dataset with publicly available data from social networks such as twitter and flickr we present a probabilistic approach in order to evaluate the potential of re-identification of the anonymized dataset.
As mobile devices and internet become available also a vast quantity of data is generated. In particular mobile telecom companies have the possibility of monitoring a large number of terminals as they connect to the network through collecting CDRs (Call Description Records). There is also publically available data from social networks such as twitter or flickr. Those services collect geo-referenced data about their users and make it available through their REST API services. This gives the possibility to infer people presence or actions in a determined context and study human and crowd behavior in a large scale.
Obviously having more data or enriching existent data with other information enables interesting applications.For example it would be interesting to know if user X in the CDR dataset is actually the same user Y from the Twitter user data and then join the two datasets. The matching process is straightforward and consists in identifying if CDR user X and Twitter user Y consistently produced data at the same time and place and once enough geo-referenced elements overlap we can be reasonably sure that users are actually the same. The dark side of the moon is that merging dataset could raise privacy issues as relations between different types of data in particular geo-referenced data can be used to infer socio-economic status, mobility and shopping patterns or even user’s social graph. On the other hand combining different datasets is a key enabler for advanced context-awareness.
The related work can be divided in two parts that are complementary. On one hand the data anonymization (in particular k-anonymity technique that means making a person indistinguishable from at least k users.) and on the other data re-identification So as anonymized data is available to researchers there is a considerable amount of works on data re-identification. Starting with some early works there is census re-identification by knowing 1-gender, ZIP and full date of birth allows for 63% of re-identification 2-re-identification of users in NetFlix Prize movie ratings dataset that NetFlix released for improving it’s recommendation system where the users where re-identified by relating their movie preferences or ratings with side information from IMDb 3-Medical records of Massachusetts Hospital using a voters list 4-re-identification of anonymous volunteers in a DNA study for Personal Genome Project More similar to our work are : unique in the crowd that analyzes mobility traces from CDR data in which the authors say that 4 geo-referenced points are enough for identifying up to 90 % of the CDR users.
So our research purpose during this work was that of experimenting in this direction asking the following question (bullet point 1). and subsequently evaluate the potential of re-identification.
CDR data consists in records or events made by a mobile device (such as incoming/outgoing calls, text messages and data transmission for Internet connections), timestamp and coordinates of the cell tower handling the event.. Social dataset is also made of records having an identifier(name or nickname), description of pic or tweet, coordinates and event timestamp.
In a) (left side) there is the distribution of events generated by 3 million CDR users with an average of about 28. At your right there is the distribution of Twitter/Flickr users. At the beginning we considered a pool of 810 user from which we decided to choose 700 of them. Basically we excluded users which had done too many events or very few events .
Combinatory approach trying to match (by time and space) every user from the first dataset with every other user in the second dataset. For example we had a match if the temporal distance between the events of the user X from the Flickr/Twitter dataset and the user Y from the CDR dataset was less than 10 minutes, and their physical distance was less than the radius of the cell tower handling the CDR event of Y.
Considering the social user FTa (in black) producing data during a time interval in different moments t1, t2, t3 and t4 (starting from the left side and moving to the right), and considering the CDR users C1, C2, C3 and C4 we can built the following matchings as by figure. We can exclude C3 as this user produced data in the same interval of time but at a distance d >> r which is the radius of the cell. Between C1, C2 and C4 the best candidate is C2 which has a better overlapping, while C1 and C4 are lacking some data but still we can not exclude them.
This slide presents some statistics of the quantity of matchings we found and their distribution. At the left there is a boxplot diagram summarizing the statistics of the number of CDR users (for a better graphical representation the y axis is in logarithmic scale) having x matching events with FT users. In the right side we have plotted the percentage of FT users that can be associated to x number of CDR users. Or course it is not possible to be completely sure about these users and for dealing with those kind of matchings we use a probabilistic approach that will be illustrated in the next slide.
The probabilistic modelling tries to answer the question : given that the CDR user C2 has n events matching with FTa how likely it is that the two users are the same? In other words how likely it is that we actually de-anonymized the CDR user C2? We choose this approach not only because we had data from only one carrier but also because the number of possible matchings(or matching events) is really high and at the end not all the CDR users can be excluded with respect to the social user i. So given the FTa user(which is our social user), we consider a discrete random variable U having Nu values Ui (with i that goes from one to N) associated to the people that could be the user FTa. This way a subset of U will be associated also to our CDR users. Theta_i is the probability that two users(each from different datasets) are the same person. Then we can assume that the probability mass function associated to U can be modelled as a Dirichlet distribution where we set each alfa_i equal to one over Nu. So if our social user matches with 10 CDR users that each of them has the same probability (one tenth) 1/10. If a CDR user falls in the exclusion condition illustrated in the previous slide then we set alpha_ i = 0. Then we count the number of times each CDR users produces events matching the events of social user as M and following the Bayes rule update the posterior probability as the conditional probability of theta given M. At the end there will be a single most probable hypothesis or Maximum a Posteriori theta_i MAP
Having considered only users having more than one match for each FT user we compute the probability of matching a CDR user. Figure a) left side, illustrates the results for a CRD-FT re-identification and it shows that the CDR user “0de7f” has a high probability and a large gap with other CDR users and even we don’t have ground truth evidence this large gap suggests the conclusion that the social user 1278644 is the same person as the CDR with whom it has such a large probability. In fig b) are shown the overall results where for each social user we compute the probability of top matching CDR user and then we count the number of CDR that are re-identified with a given probability and in this case with probability larger than 0.1. There are 260 social users we re-identified and this number is about one third of the social dataset we considered.
Model based on a number of independency assumptions that can be hardly justified in the real world. Also the random variable being used tend to have a large number of possible outcomes and the overall probability distribution remains low even after a large number of matching events. Privacy concerns are the main impeding factor to prevent CDR data to be applied in pervasive applications but we believe that a viable approach can be that of a mechanism of differential anonymization that could preserve privacy without destroying the utility of the dataset for a single aspect that is the one useful for the specific application. Correlation among datasets represents a big opportunity to enrich the information available to a pervasive application for the achievement of pervasive computing vision.

Re-identification of Anomized CDR datasets using Social networlk Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Re-identification of Anomized CDR datasets using Social networlk Data

Similar to Re-identification of Anomized CDR datasets using Social networlk Data (20)

More from Alket Cecaj

More from Alket Cecaj (6)

Recently uploaded

Recently uploaded (20)

Re-identification of Anomized CDR datasets using Social networlk Data

Editor's Notes