• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Studying user footprints in different online social networks
 

Studying user footprints in different online social networks

on

  • 25,132 views

With the growing popularity and usage of online social media services, people now have accounts (some times several) on multiple and diverse services like Facebook, LinkedIn, Twitter and YouTube. ...

With the growing popularity and usage of online social media services, people now have accounts (some times several) on multiple and diverse services like Facebook, LinkedIn, Twitter and YouTube. Publicly available information can be used to create a digital footprint of any user using these social media services. Generating such digital footprints can be very useful for personalization, profile management, detecting malicious behavior of users. A very important application of analyzing users’ online digital footprints is to protect users from potential privacy and security risks arising from the huge publicly available user information. We extracted information about user identities on different social networks through Social Graph API, FriendFeed, and Profilactic; we collated our own dataset to create the digital footprints of the users. We used username, display name, description, location, profile image, and number of connections to generate the digital footprints of the user. We applied context specific techniques (e.g. Jaro Winkler
similarity, Wordnet based ontologies) to measure the similarity of the user profiles on different social networks. We specifically focused on Twitter and LinkedIn. In this paper, we present the analysis and results from applying automated classifiers for
disambiguating profiles belonging to the same user from different social networks UserID and Name were found to be the most discriminative features for disambiguating user profiles. Using the most promising set of features and similarity metrics, we
achieved accuracy, precision and recall of 98%, 99%, and 96%, respectively.

Statistics

Views

Total Views
25,132
Views on SlideShare
25,132
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Studying user footprints in different online social networks Studying user footprints in different online social networks Presentation Transcript

    • Studying User Footprints in Different Online Social Networks Studying User Footprints in Different Online Social Networks Anshu Malhotra 1 , Luam Totti2 , Wagner Meira Jr.2 , Ponnurangam Kumaraguru 1 , Virg´ Almeida2 ılio 1 Indraprastha Institute of Information Technology New Delhi, India 2 Universidade Federal de Minas Gerais Belo Horizonte, Brazil August, 2012
    • Studying User Footprints in Different Online Social Networks Online Digital Footprints Users commonly register and access accounts (some times several) on multiple and diverse online services like Facebook, LinkedIn, Twitter and Youtube The set of all information related to the user, either provided directly or observed from the user’s interaction, is often called the user’s online digital footprint [6]
    • Studying User Footprints in Different Online Social Networks Linking User’s Online Accounts To create a user’s digital footprint the user’s multiple accounts must be known. We call this process linking user’s online accounts.
    • Studying User Footprints in Different Online Social Networks Linking User’s Online Accounts Linking user accounts from different services can serve several purposes [1, 4, 10, 11, 12, 13]: Centralize user information, enforcing data consistency and simplifying account maintenance Enrich recommendation systems Cross-system personalization Enable cross-system characterization and pattern analysis Assess and possibly prevent unwanted information leakage, thereby protecting users from various privacy and security threats
    • Studying User Footprints in Different Online Social Networks Main Challenges Users may choose different (and unrelated) usernames on different services, which may be unrelated to their real names [5] People with common names tend to have similar usernames [8, 17] Users may enter inconsistent and misleading information across their profiles [5], unintentionally or often deliberately in order to preserve privacy Heterogeneity in the network structure and profile fields among the services
    • Studying User Footprints in Different Online Social Networks Existing Techniques Various techniques have been proposed for unifying / disambiguating users’ various profiles across different online services: Techniques based on FOAF ontology & graphs [4, 9, 10, 11] Techniques based on user generated tags [5, 12, 13] Techniques based on usernames [8, 17] Techniques based on user profile attributes [1, 2, 3, 6, 7, 14]
    • Studying User Footprints in Different Online Social Networks Limitations of Existing Techniques Specificity to certain types of social networks Dependency on identifiers like email IDs, Instant Messenger IDs which might not be publicly available Use of simple text matching algorithms for comparing complex profile fields Manual and experimental assignment of weights and thresholds which can be subjective and not scalable
    • Studying User Footprints in Different Online Social Networks Limitations of Existing Techniques Use of small datasets (biggest being 5,000 users), wherein the data collection and evaluation was done manually in some approaches Real world evaluation has not been done for most of the techniques Computationally expensive
    • Studying User Footprints in Different Online Social Networks Major Contributions of our Work An scalable supervised learning approach for linking users’ accounts from different services Evaluation of different context specific similarity metrics for comparing different profile fields Results using a large dataset of linking user accounts across Twitter and LinkedIn Evaluation of the system’s performance for discovering new accounts for a given user.
    • Studying User Footprints in Different Online Social Networks User Profile Disambiguation - System Architecture Account Correlation Extractor: collates the dataset of user profiles known to be belonging to the same user across different social networks Profile Crawler crawls the public profile information from user accounts for these services A user’s Online Digital Footprints are generated after Feature Extraction and Selection Various Classifiers are trained for account pairs belonging to the same users and pairs belonging to different users, which are then used to disambiguate user profiles i.e. classify the given input profile pairs to be belonging to the same user or not
    • Studying User Footprints in Different Online Social Networks User Profile Disambiguation - System Architecture
    • Studying User Footprints in Different Online Social Networks Dataset Collection - 1st Stage The training and testing dataset was collated from two types of sources: Social Aggregators: Services that allow users to specify their multiple accounts in order to create an unified feed. We crawled 883,668 users from FriendFeed1 and 38,755 users from Profilactic.2 Social Graph:3 API that constructs and provides social interactions data, including information about users accounts on multiple services. Of the 14 million users collected, only 3.9 million had useful information. 1 http://friendfeed.com/ http://www.profilactic.com/ 3 http://code.google.com/apis/socialgraph/docs/ 2
    • Studying User Footprints in Different Online Social Networks Dataset Collection - Example twitter,justinbieber youtube,kidrauhl youtube,justinbieber twitter,aplusk facebook,ashton youtube,felipeneto youtube,felipenetovlog youtube,maspoxavida twitter,pecesiqueira youtube,jp youtube,MysteryGuitarMan youtube,descealetra twitter,cauemoura twitter,jcillpam11six facebook,jeremiahcillpam11six twitter,cjdances facebook,cjperrydances twitter,sirhilton facebook,richardsrueda
    • Studying User Footprints in Different Online Social Networks Dataset Collection - 2nd Stage Publicly available information on each account was then collected from each service Four services were initially chosen for the analysis Twitter, LinkedIn, YouTube and Flickr Six profile fields were chosen for the analysis UserID, display name, description, location, connections, image
    • Studying User Footprints in Different Online Social Networks Dataset Collection Due to high percentage of missing fields, YouTube and Flickr were excluded All further analysis in this work refers only to Twitter and LinkedIn accounts 80 70 Missing (%) 60 Twitter LinkedIn YouTube Flickr 50 40 30 20 10 0 Name Location Description Image
    • Studying User Footprints in Different Online Social Networks Profile Similarity A profile may be seen as a N-dimensional vector, where each component is a profile field [14] Therefore, comparing profiles can be done component-wise However, components are of very distinct nature and hence demand different similarity methods for comparison In this work we evaluated different approaches for comparing each profile field
    • Studying User Footprints in Different Online Social Networks Similarity Metrics UserID & Display Name: Jaro-Winkler[15] distance (JW ) is best suited for similarity between small strings, hence it was used for both these fields Description (desc): The fields had punctuation and stop words removed. The words were then lemmatized and converted to lower case to produce the final token set. TF-IDF: Cosine similarity between the two token sets using their tf-idf vector space representation Jaccard (Jacc): Jaccard’s similarity score between the two token sets Ontology (Ont): Wu-Palmer [16] similarity distance between the Wordnet based ontologies of each description field
    • Studying User Footprints in Different Online Social Networks Similarity Metrics Location (Loc): Tokens were extracted from the location fields of both profiles by removing the punctuations and converting them to lower case. Sub-string (Substr ): Normalized score of number of tokens from one field value present as a substring of the other field value Geo-distance (Geo): Euclidean Distance between the two locations using their latitude and longitudes (Google Maps GeoCoding API4 ). Jaro-Winkler distance Jaccard’s score 4 https://developers.google.com/maps
    • Studying User Footprints in Different Online Social Networks Similarity Metrics Profile Image (img ): The profile image was downloaded and stored locally Each image was then scaled down to 48 X 48 pixels using cubic spline interpolation Each image was then converted to gray scale. Each image could then be represented as a vector of values from 0 to 255 to which functions for computing Mean Square Error (mse), Peak Signal-to-Noise Ratio (psnr), and Levenshtein (ls) were applied to quantify profile image similarity
    • Studying User Footprints in Different Online Social Networks Similarity Metrics Number Of Connections (conn): For Twitter the number of connections of a user u is the number of users that u follows. For LinkedIn is the number of users in the private network of user u The number of connections in different services can assume different ranges, with different meanings Normalized (norm): Each connection value c was normalized to the range [0..1] using the smallest and greatest connection values observed in each service. The similarity was then taken as the unsigned difference between the two values. Class: Each value was assigned a (equally sized) class denoting how big it was. Five classes were used in this work (0-4). The similarity was taken as the different between the two classes.
    • Studying User Footprints in Different Online Social Networks Evaluation Experiments Feature Analysis: To analyze the discriminative capacity of different profile attributes and similarity metrics for disambiguating user profiles Matching Profiles: To test the effectiveness of supervised learning approaches for classifying two profiles as belonging the same user Discovering Candidate Profiles: To evaluate the performance of our framework for discovering new accounts for a given known user The analysis was done using a dataset of account pairs of 29,129 unique users
    • Studying User Footprints in Different Online Social Networks Feature Analysis UserID IG Relief MDL Gini Name JW JW Jacc Description TF-IDF Ont Norm Connections Class 0.548 0.434 0.379 0.151 0.812 0.521 0.562 0.217 0.286 0.134 0.274 0.084 0.323 0.180 0.300 0.092 0.161 0.113 0.188 0.051 0.000 0.002 -0.006 0.000 0.009 0.095 0.006 0.003 Location Image JW IG Relief MDL Gini Jacc Substr Geo MSE PSNR LS 0.232 0.108 0.158 0.067 0.337 0.041 0.233 0.098 0.350 0.039 0.270 0.102 0.520 0.227 0.488 0.146 0.183 0.157 0.205 0.051 0.184 0.158 0.205 0.051 0.215 0.188 0.227 0.061 Table : Discriminative capacity of each pair < feature, metric > according to four different approaches.
    • Studying User Footprints in Different Online Social Networks Feature Analysis - Box Plots userid jw name jw location jw 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 Match Non Match 0 Match location jaccard Non Match Match location substring Non Match location geo 300 1 1 0.8 0.8 0.6 0.6 0.4 0.4 100 0.2 0.2 50 0 0 Match Non Match 250 200 150 0 Match Non Match Match Non Match Figure : Box plots for the UserID, Name and Location features.
    • Studying User Footprints in Different Online Social Networks Feature Analysis - Box Plots description jaccard description tf-idf 0.5 1 0.4 0.8 0.3 0.6 0.2 0.4 0.1 description ontology 0.2 0 0.1 0 0 Match Non Match Match connections class Non Match Match image psnr 5 30 4 image levenshtein 25 3 Non Match 20 1 15 2 0.9 10 1 5 0 0 Match Non Match 0.8 Match Non Match Match Non Match Figure : Box plots for the Description, Connections and Image features.
    • Studying User Footprints in Different Online Social Networks Matching User Profiles The most promising similarity metrics and features were used to train classifiers for the task of detecting profiles that belong to the same user Similarity Vector: E.g.: <useridjw , descjaccard , locgeo > Training Set: Positive Examples: Similarity vectors for the accounts pairs of the dataset Negative Examples: Equal number of negative examples synthesized by randomly pairing accounts from different users and calculating their similarity vectors A total of 58,258 training instances
    • Studying User Footprints in Different Online Social Networks Matching User Profiles After training the classifiers were tested with Twitter-LinkedIn profile pairs to be classified as a “Match” or a “Non Match” A “Match” means that the two given input profiles belong to the same user, while “Non Match” means they don’t Classifiers used: Na¨ Bayes ıve Decision Tree SVM kNN
    • Studying User Footprints in Different Online Social Networks Matching User Profiles Results were generated for all possible combinations of profile features and similarity metrics using 10-fold cross validation. As shown below, we achieved accuracy, precision and recall as 98%, 99% and 96% respectively for the best feature set Na¨ Bayes ıve Decision Tree SVM kNN Accuracy 0.980 0.965 0.972 0.898 Precision 0.996 0.994 0.988 0.998 Recall 0.964 0.936 0.956 0.798 F1 0.980 0.964 0.971 0.887 Table : Results for multiple classifiers using the feature set {namejw , useridjw , locgeo , desctfidf , imgls , connnorm }.
    • Studying User Footprints in Different Online Social Networks Discovering New User Profiles So the results are very good using an static dataset, but what if we don’t have candidates to match to a known user profile? A system was developed for retrieving profile candidates of possible matches for a known account from some other service. A part (one fifth) of the true positive data was reserved to be the testing set A Na¨ Bayes classifier was trained with the remaining set ıve and was modified to return the probability of the two input profiles of belonging to the same user
    • Studying User Footprints in Different Online Social Networks Discovering Candidate User Profiles We query Twitter’s API using LinkedIn’s display name for each profile pair from the testing dataset For each of the profiles returned from the Twitter API, we compute the similarity vector with the LinkedIn profile of the user We next used the trained classifier to return the probability of each of these profiles of belonging to the same user We rank the Twitter profiles in decreasing order of their probabilities Ideally the correct Twitter profile (of the profile pair from the testing set) should be at the top of this ranking
    • Studying User Footprints in Different Online Social Networks Discovering Candidate User Profiles 80 Profiles Found (%) 75 70 65 60 55 50 All features Best Features 45 40 5 10 15 20 Rank position Figure : Relation between the position in the rank r and the percentage of times the right profile is found in a position lower or equal to r .
    • Studying User Footprints in Different Online Social Networks Discovering Candidate User Profiles In 64% of the cases the right profile was found in the first position of the rank when using all features, while this value was 49% for the set of the best features This means that using all features instead of only the best can help to disambiguate between the possible candidates. In 75% of the times the right profile was in the first 3 positions of the rank This suggests the system can be used in a semi-supervised manner
    • Studying User Footprints in Different Online Social Networks Conclusions & Results Applied automated techniques to identify accounts beloging to a same user in different online services Only publicly available information was extracted and used Proposed and evaluated multiple similarity metrics, comparing their discriminative capacity for the task profile linking UserID and Name, when compared using the Jaro-Winkler metric, were the most discriminative features
    • Studying User Footprints in Different Online Social Networks Conclusions & Results For the best set of features and similarity metrics we achieved accuracy, precision and recall as 98%, 99% and 96% respectively Evaluation of the system’s performance for discovering the user’s profile on Twitter given his display name on LinkedIn Using all features instead of only the most discriminative ones has shown better results The system may be used to match user profile automatically with 64% accuracy or in a semi supervised manner to narrow down candidate profiles
    • Studying User Footprints in Different Online Social Networks Future Work Incorporate more profile fields Generalize our model to include other social networks Adapt our system to handle missing and incorrect profile attributes
    • Studying User Footprints in Different Online Social Networks For any further information, please write to pk@iiitd.ac.in precog.iiitd.edu.in
    • Studying User Footprints in Different Online Social Networks Bibliography I Carmagnola, F., and Cena, F. User identification for cross-system personalisation. Inf. Sci. 179, 1-2 (Jan. 2009), 16–32. Carmagnola, F., Osborne, F., and Torre, I. Cross-systems identification of users in the social web. In 8th IADIS Int. Conf. WWW/INTERNET, Rome, Italy (2009), pp. 129–134. Carmagnola, F., Osborne, F., and Torre, I. User data distributed on the social web: how to identify users on different social systems and collecting data about them. In Proceedings of the 1st International Workshop on Information Heterogeneity and Fusion in Recommender Systems (New York, NY, USA, 2010), HetRec ’10, ACM, pp. 9–15. Golbeck, J., and Rothstein, M. Linking social networks on the web with foaf: a semantic web case study. AAAI’08, pp. 1138–1143. Iofciu, T., Fankhauser, P., Abel, F., and Bischoff, K. Identifying users across social tagging systems, 2011.
    • Studying User Footprints in Different Online Social Networks Bibliography II Irani, D., Webb, S., Li, K., and Pu, C. Large online social footprints–an emerging threat. In CSE ’09 (aug. 2009), vol. 3, pp. 271 –276. Kontaxis, G., Polakis, I., Ioannidis, S., and Markatos, E. Detecting social network profile cloning. In PerCom (march 2011), pp. 295 –300. ˆ Perito, D., Castelluccia, C., Kaafar, M. A., and Manils, P. How unique and traceable are usernames? In PETS (2011), pp. 1–17. Rowe, M. Applying semantic social graphs to disambiguate identity references. In 6th Annual European Semantic Web Conference (ESWC2009) (June 2009), pp. 461–475. Rowe, M. Interlinking distributed social graphs. In LDOW2009 (April,Spring 2009).
    • Studying User Footprints in Different Online Social Networks Bibliography III Rowe, M., and Ciravegna, F. Harnessing the social web: The science of identity disambiguation. In Web Science Conference (2010). Szomszor, M., Alani, H., Cantador, I., O’Hara, K., and Shadbolt, N. Semantic modelling of user interests based on cross-folksonomy analysis. ISWC ’08, pp. 632–648. Szomszor, M. N., Cantador, I., and Alani, H. Correlating user profiles from multiple folksonomies. HT ’08, pp. 33–42. Vosecky, J., Hong, D., and Shen, V. Y. User identification across multiple social networks. In NDT’09 (July 2009). Winkler, W. E. String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In Proceedings of the Section on Survey Research (1990), pp. 354–359.
    • Studying User Footprints in Different Online Social Networks Bibliography IV Wu, Z., and Palmer, M. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics (Stroudsburg, PA, USA, 1994), ACL ’94, Association for Computational Linguistics, pp. 133–138. Zafarani, R., and Liu, H. Connecting corresponding identities across communities, 2009.