Re-identification of Anomized CDR datasets using Social networlk Data
1. Re-identification of Anonymized CDR
datasets Using Social network Data
Alket Cecaj, Marco Mamei, Nicola Bicocchi
University of studies of Modena and Reggio Emilia
PerCom 2014
3. Dataset join and privacy issues
• Matching different users associated to the same real
person.
• Privacy issues: any kind of information can be inferred
● Join different datasets is the key for advanced forms of
context awareness
4. Related work
Anonymization..
and re-identification
• Gender, ZIP and full date of birth 63% of re-identification
• movie ratings from NetFlix Prize dataset
• Medical records of Massachusetts Hospital using a voters list
• re-identification of anonymous volunteers in a DNA study for Personal
Genome Project
In line with our domain
• Unique in the Crowd: the privacy bounds of Human Mobility
• Markov chain models for de-anonymization of geo-located data
5. Dataset join and privacy issues.
• Can we use data from social networks to re-
identify users for an anonymized dataset
such as a CDR one?
• Probabilistic approach to evaluate the re-
identification potential.
7. CDR and Social Dataset - Distribution of events
● CDR
● on average 28 events/period , max = 330, min = 3
● 2.019321 users for final analysis
● Social dataset
● on average 20 events/period , max = 424, min = 3
● 700 users for final analysis
8. Matching users among datasets
● Time and space parameters for matching for example 10min of time
interval between events and cell radius as physical distance
● Clone of social dataset in order to check/verify the quantity of matchings
that were done by chance following Bonferroni’s principle.
● Exclusion of CDR users making events in the same time but in a long
distance much bigger that the cell radius.
13. Conclusions
Potential and/or limits of re-identification of users across
multiple mobility datasets.
Future research:
• the current model and overall approach needs refinement
• privacy concerns though mechanisms for preserving privacy and
data utility for a single aspect
• correlation among data sets represents a big opportunity to enrich the
information available to a pervasive application
14. Thank you for your attention.
Questions are welcome.
Editor's Notes
My name is Alket Cecaj and I’m a PhD student at the University of studies of Modena and Reggio Emilia. In this work which has been done together with my supervisor Marco Mamei, and with Nicola Bicocchi we examine a large dataset of 335 million, anonymized call records made by 3 million users during a period of 47 days in a region of northern Italy. By combining this dataset with publicly available data from social networks such as twitter and flickr we present a probabilistic approach in order to evaluate the potential of re-identification of the anonymized dataset.
As mobile devices and internet become available also a vast quantity of data is generated. In particular mobile telecom companies have the possibility of monitoring a large number of terminals as they connect to the network through collecting CDRs (Call Description Records). There is also publically available data from social networks such as twitter or flickr. Those services collect geo-referenced data about their users and make it available through their REST API services. This gives the possibility to infer people presence or actions in a determined context and study human and crowd behavior in a large scale.
Obviously having more data or enriching existent data with other information enables interesting applications.For example it would be interesting to know if user X in the CDR dataset is actually the same user Y from the Twitter user data and then join the two datasets. The matching process is straightforward and consists in identifying if CDR user X and Twitter user Y consistently produced data at the same time and place and once enough geo-referenced elements overlap we can be reasonably sure that users are actually the same. The dark side of the moon is that merging dataset could raise privacy issues as relations between different types of data in particular geo-referenced data can be used to infer socio-economic status, mobility and shopping patterns or even user’s social graph. On the other hand combining different datasets is a key enabler for advanced context-awareness.
The related work can be divided in two parts that are complementary. On one hand the data anonymization (in particular k-anonymity technique that means making a person indistinguishable from at least k users.) and on the other data re-identification
So as anonymized data is available to researchers there is a considerable amount of works on data re-identification. Starting with some early works there is census re-identification by knowing
1-gender, ZIP and full date of birth allows for 63% of re-identification
2-re-identification of users in NetFlix Prize movie ratings dataset that NetFlix released for improving it’s recommendation system where the users where re-identified by relating their movie preferences or ratings with side information from IMDb
3-Medical records of Massachusetts Hospital using a voters list
4-re-identification of anonymous volunteers in a DNA study for Personal Genome Project
More similar to our work are : unique in the crowd that analyzes mobility traces from CDR data in which the authors say that 4 geo-referenced points are enough for identifying up to 90 % of the CDR users.
So our research purpose during this work was that of experimenting in this direction asking the following question (bullet point 1). and subsequently evaluate the potential of re-identification.
CDR data consists in records or events made by a mobile device (such as incoming/outgoing calls, text messages and data transmission for Internet connections), timestamp and coordinates of the cell tower handling the event.. Social dataset is also made of records having an identifier(name or nickname), description of pic or tweet, coordinates and event timestamp.
In a) (left side) there is the distribution of events generated by 3 million CDR users with an average of about 28. At your right there is the distribution of Twitter/Flickr users. At the beginning we considered a pool of 810 user from which we decided to choose 700 of them. Basically we excluded users which had done too many events or very few events .
Combinatory approach trying to match (by time and space) every user from the first dataset with every other user in the second dataset. For example we had a match if the temporal distance between the events of the user X from the Flickr/Twitter dataset and the user Y from the CDR dataset was less than 10 minutes, and their physical distance was less than the radius of the cell tower handling the CDR event of Y.
Considering the social user FTa (in black) producing data during a time interval in different moments t1, t2, t3 and t4 (starting from the left side and moving to the right), and considering the CDR users C1, C2, C3 and C4 we can built the following matchings as by figure. We can exclude C3 as this user produced data in the same interval of time but at a distance d >> r which is the radius of the cell. Between C1, C2 and C4 the best candidate is C2 which has a better overlapping, while C1 and C4 are lacking some data but still we can not exclude them.
This slide presents some statistics of the quantity of matchings we found and their distribution. At the left there is a boxplot diagram summarizing the statistics of the number of CDR users (for a better graphical representation the y axis is in logarithmic scale) having x matching events with FT users. In the right side we have plotted the percentage of FT users that can be associated to x number of CDR users. Or course it is not possible to be completely sure about these users and for dealing with those kind of matchings we use a probabilistic approach that will be illustrated in the next slide.
The probabilistic modelling tries to answer the question : given that the CDR user C2 has n events matching with FTa how likely it is that the two users are the same? In other words how likely it is that we actually de-anonymized the CDR user C2? We choose this approach not only because we had data from only one carrier but also because the number of possible matchings(or matching events) is really high and at the end not all the CDR users can be excluded with respect to the social user i. So given the FTa user(which is our social user), we consider a discrete random variable U having Nu values Ui (with i that goes from one to N) associated to the people that could be the user FTa. This way a subset of U will be associated also to our CDR users. Theta_i is the probability that two users(each from different datasets) are the same person. Then we can assume that the probability mass function associated to U can be modelled as a Dirichlet distribution where we set each alfa_i equal to one over Nu. So if our social user matches with 10 CDR users that each of them has the same probability (one tenth) 1/10. If a CDR user falls in the exclusion condition illustrated in the previous slide then we set alpha_ i = 0. Then we count the number of times each CDR users produces events matching the events of social user as M and following the Bayes rule update the posterior probability as the conditional probability of theta given M. At the end there will be a single most probable hypothesis or Maximum a Posteriori theta_i MAP
Having considered only users having more than one match for each FT user we compute the probability of matching a CDR user. Figure a) left side, illustrates the results for a CRD-FT re-identification and it shows that the CDR user “0de7f” has a high probability and a large gap with other CDR users and even we don’t have ground truth evidence this large gap suggests the conclusion that the social user 1278644 is the same person as the CDR with whom it has such a large probability. In fig b) are shown the overall results where for each social user we compute the probability of top matching CDR user and then we count the number of CDR that are re-identified with a given probability and in this case with probability larger than 0.1. There are 260 social users we re-identified and this number is about one third of the social dataset we considered.
Model based on a number of independency assumptions that can be hardly justified in the real world. Also the random variable being used tend to have a large number of possible outcomes and the overall probability distribution remains low even after a large number of matching events.
Privacy concerns are the main impeding factor to prevent CDR data to be applied in pervasive applications but we believe that a viable approach can be that of a mechanism of differential anonymization that could preserve privacy without destroying the utility of the dataset for a single aspect that is the one useful for the specific application. Correlation among datasets represents a big opportunity to enrich the information available to a pervasive application for the achievement of pervasive computing vision.