Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Perceived versus Actual Predictability of Personal Information in Social Networks


Published on

This paper looks at the problem of privacy in the context
of Online Social Networks (OSNs). In particular, it examines the predictability of diff erent types of personal information based on OSN data and compares it to the perceptions of users about the disclosure of their information. To this end, a real life dataset is composed. This consists of the Facebook data (images, posts and likes) of 170 people along with
their replies to a survey that addresses both their personal information, as well as their perceptions about the sensitivity and the predictability of diff erent types of information. Importantly, we evaluate several learning techniques for the prediction of user attributes based on their OSN data. Our analysis shows that the perceptions of users with respect to
the disclosure of speci fic types of information are often incorrect. For instance, it appears that the predictability of their political beliefs and employment status is higher than they tend to believe. Interestingly, it also appears that information that is characterized by users as more sensitive, is actually more easily predictable than users think, and vice versa (i.e. information that is characterized as relatively less sensitive is less easily predictable than users might have thought).

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Perceived versus Actual Predictability of Personal Information in Social Networks

  1. 1. Perceived versus Actual Predictability of Personal Information in Social Networks Eleftherios (Lefteris) Spyromitros-Xioufis1, Georgios Petkos1, Symeon Papadopoulos1, Rob Heyman2, Yiannis Kompatsiaris1 1Center for Research and Technology Hellas – Information Technologies Institute (CERTH-ITI) 2iMinds-SMIT, Vrije Universiteit Brussel, Brussels, Belgium INSCI 2016, Sep 12-14, 2016, Florence, Italy 1
  2. 2. Disclosure of Personal Information in OSNs  Online Social Networks (OSNs) have had transforming impact! • People use it for communication, as news source, to make business,…  However, participation in OSNs comes at a price! • User-related data is shared with: • a) other OSN users, b) the OSN itself, c) third parties (e.g. ad networks) • Disclosure of specific types of data: • e.g. gender, age, ethnicity, political or religious beliefs, sexual preferences, financial status, etc. • Has implications: • e.g. unjustified discrimination in personnel selection / loan approval • Information need not be explicitly disclosed! • Several types of personal information can be accurately inferred based on implicit cues (e.g. Facebook likes) using machine learning! 2
  3. 3. Inferring Personal Information  Supervised learning algorithms • Learn a mapping (model) from inputs 𝒙𝑖 to outputs 𝑦 𝑖 by analyzing a set of training examples 𝐷 = 𝒙𝑖, 𝑦 𝑖 𝑖 𝑁 • In this case • 𝑦 𝑖 corresponds to a personal user attribute, e.g. sexual orientation • 𝒙𝑖 corresponds to a set of predictive attributes or features, e.g. user likes • Using this mapping, inferences can be made for new users!  Some previous results • Kosinski et al. [1]: likes features (SVD) + logistic regression • Highly accurate inferences of ethnicity, gender, sexual orientation, etc. • Schwartz et al. [2] status updates (PCA) + linear SVM • Highly accurate inference of gender 3 [1] Kosinski, et al. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 2013. [2] Schwartz, et al. Personality, gender, and age in the language of social media: The open- vocabulary approach. PloS one, 2013.
  4. 4. Inferred Information & Privacy in OSNs  Study of user awareness with regard to inferred information largely neglected by social research on OSN privacy  Privacy usually presented as a question of giving access or communicating personal information to a particular party • E.g. Westin’s [1] definition of privacy: “The claim of individuals, groups, or institutions to determine for themselves when, how, and to what extent information about them is communicated to others.”  However, access control is non-existent for inferred information: a) Users are unaware of the inferences being made b) Have not control over their logic  Aim of our work: • Investigate if and how users intuitively grasp what can be inferred from their disclosed data! 4[1] Alan Westin. Privacy and freedom. Bodley Head, London, 1970.
  5. 5. Main Research Questions  Our study attempts to answer the following questions: 1. Predictability • How predictable different types of personal information are, based on users’ OSN data? 2. Actual vs perceived predictability • How realistic are user perceptions about predictability of their personal information? 3. Predictability vs sensitivity • What is the relationship between perceived sensitivity and predictability of personal information?  Previous work has focused mainly on Q1  We address Q1 using a variety of data and methods and additionally we address Q2 and Q3 5
  6. 6. What data is needed for this study?  We collected 3 types of data about 170 Facebook users: 1. OSN data: likes, posts, images • Collected through a test Facebook application (Databait1 developed within the USEMP2 FP7 project) 2. Answers to questions about 96 personal attributes, organized3 into 9 categories (disclosure dimensions) • E.g. health factors, sexual orientation, income, political attitude, etc. 3. Answers to questions related to their perceptions about predictability and sensitivity of the 9 disclosure dimensions  What is the purpose of each data type? • 1 & 2 allow accessing actual predictability of personal information • Training sets for supervised learning algorithms • 3 facilitates a comparison between actual predictability and perceived predictability/sensitivity of personal information 6 1 2 3
  7. 7. Example from the questionnaire 7  What is your sexual orientation? • Ground truth!  Do you think the information on your Facebook profile reveals your sexual orientation? Either because you yourself have put it online, or it could be inferred from a combination of posts. • Measures perceived predictability  How sensitive do you find the information you had to reveal about your sexual orientation in the previous section? (1=not sensitive at all, 7= very sensitive) • Measures perceived sensitivity Response No. of participants heterosexual 147 homosexual 14 bisexual 7 n/a 2 Response No. of participants yes 134 no 33 n/a 3
  8. 8. Predictive Attributes Extracted from OSN Data  likes: binary vector denoting presence/absence of like (#3.6K)  likesCats: histogram of like category frequencies (#191)  likesTerms: Bag-of-Words (BoW) of terms in description, title and about sections of likes (#62.5K)  msgTerms: BoW vector of terms in user posts (#25K)  lda-t: Distribution of topics in the textual contents of both likes (description, title and about section) and posts • Latent Dirichlet Allocation with t=20,30,50,100  visual: concepts depicted in user images (#11.9K) • Detected using CNN, top 12 concepts per images, 3 variants • visual-bin: hard 0/1 encoding • visual-freq: concept frequency histogram • visual-conf: sum of detection scores across all images 8
  9. 9. Experimental Setup  Evaluation method: repeated random sub-sampling • Data split randomly 𝑛 = 10 times into train (67%) / test (33%) • Model fit on train / accuracy of inferences assessed on test • 96 questions (user attributes) were considered  Evaluation measure: area under ROC curve (AUC) • Appropriate for imbalanced classes  Classification algorithms • Baseline: 𝑘-nearest neighbors, decision tree, Naïve Bayes • SoA: Adaboost, random forest, regularized logistic regression 9
  10. 10. Results 1: Evaluating Classifiers 10 0.45 0.50 0.55 0.60 0.65 0.70 0.75 bmiclass healthstatus smoking behavior drinking behavior income cannabis employment sexual orientation tree nb knn adaboost rf logistic
  11. 11. Results 2: Evaluating Features 11 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 LDA-20 LDA-30 LDA-50 LDA-100 likesCats msgTerms likesTerms likes visual-bin visual-conf visual-freq rf logistic
  12. 12. 12 0.53 0.54 0.55 0.56 0.57 0.58 visual-conf likesCats msgTerms likesTerms LDA-30 likes visual-conf/likesCats likesCats/likes visual-conf/msgTerms likesTerms/likesCats msgTerms/likesTerms msgTerms/likesCats visual-conf/likes visual-conf/likesTerms LDA-30/msgTerms msgTerms/likes likesTerms/likes LDA-30/likesTerms LDA-30/likesCats visual-conf/LDA-30 LDA-30/likes nolatefusion Results 3: Combining Features
  13. 13. Results 4: Best Performance per Attribute 13 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 degree differentorigins gender language nationality residence income employment livingsituation relationshipstatus religiousstance religiouspractise has-an-assertive-personality tends-to-be-lazy can-be-cold-and-aloof remains-calm-in-tense-situatio has-few-artistic-interests is-sophisticated-in-art-music- is-emotionally-stable-not-easi generates-a-lot-of-enthusiasm starts-quarrels-with-others does-a-thorough-job perseveres-until-the-task-is-f has-an-active-imagination is-full-of-energy is-reserved is-considerate-and-kind-to-alm is-relaxed-handles-stress-well gets-nervous-easily likes-to-reflect-play-with-ide is-sometimes-shy-inhibited worries-a-lot prefers-work-that-is-routine tends-to-be-quiet values-artistic-aesthetic-expe likes-to-cooperate-with-others is-generally-trusting is-easily-distracted makes-plans-and-follows-throug is-sometimes-rude-to-others is-depressed-blue has-a-forgiving-nature tends-to-find-fault-with-other is-original-comes-up-with-new- does-things-efficiently tends-to-be-disorganised can-be-tense is-curious-about-many-differen is-outgoing-sociable is-inventive can-be-somewhat-careless is-talkative is-helpful-and-unselfish-with- is-ingenious-a-deep-thinker can-be-moody is-a-reliable-worker sexualOrientation politicalideology bmiclass healthstatus cigarettes smokingbehavior alcohol drinkingbehavior nosubstance coffee energydrink cannabis Playing-hockey Running Eating-out Going-to-the-movies Cooking Watching-series-or-movies-at-h Reading Listening-to-music Bicycling Swimming Cars-motorcycles-boats Playing-music Shopping Travelling Playing-tennis Walking Dancing Skiing Watching-sports Exercising Going-to-the-theatre Hiking Animals Going-to-the-beach Camping Gardening Playing-basketball Playing-soccer Playing-volleyball 1 2 3 4 5 6 7 8 10 1 demographics 2 employment/income 3 relationship/living 4 religion 5 personality 6 sexual orientation 7 political ideology 8 health factors 10 consumer profile
  14. 14. Ranking of Dimensions 14 Rank Perceived predictability Actual predictability Actual predictability according to [1] 1 Demographics Demographics - Demographics 2 Relationship status and living condition Political views +3 Political views 3 Sexual orientation Sexual orientation - Religious views 4 Consumer profile Employment/Income +4 Sexual orientation 5 Political views Consumer profile -1 Health status 6 Personality traits Relationship status and living condition -4 Relationship status and living condition 7 Religious views Religious views - 8 Employment/Income Health status +1 9 Health status Personality traits -3 [1] Kosinski, et al. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 2013.
  15. 15. Perceived/Actual Predictability vs Sensitivity 15
  16. 16. Conclusions & Future Work  Conclusions • Both correct and incorrect perceptions about predictability • Predictability of sensitive information is underestimated • Sophisticated privacy assistance tools are needed • Support users in managing disclosure of personal information  Databait: a privacy assistance tool (still in beta mode) 16
  17. 17. Thank you!  Resources • Code/models: • Databait:  Contact us 17 @espyromi @sympap @kompats