Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Predicting user demographics in social networks - Invited Talk at University of Warwick

1,018 views

Published on

Automatically inferring user demographics in social networks is useful for both social science research and a range of downstream applications in marketing and politics. Our main hypothesis is that language use in social networks is indicative of user attributes. This talk presents recent work on inferring a new set of socioeconomic attributes, i.e. occupational class, income and socioeconomic class. We define a predictive task for each attribute where user-generated content is utilised to train supervised non-linear methods for classification and regression, i.e. Gaussian Processes. We show that our models achieve strong predictive accuracy in all of the three demographics while our analysis sheds light to factors that differentiate users between occupations, income level and socioeconomic classes.

Published in: Science
  • who will win this game? get free picks and predictions. ▲▲▲ http://ishbv.com/zcodesys/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Predicting user demographics in social networks - Invited Talk at University of Warwick

  1. 1. Predicting User Demographics in Social Networks Nikolaos Aletras Amazon aletras@amazon.com
  2. 2. introduction
  3. 3. Motivation • The World Wide Web, social networks and smart devices • Vast amounts of user-generated content available online such as social media posts, photos, query logs etc. • Big Data and Data Science era • Can we use these data for social good?
  4. 4. Why is it useful? • Enable large scale studies in social sciences • analyse human behaviour on a large scale • computational social science • Tackle real world problems • education • health intervention/surveillance • economic development
  5. 5. Why is it useful? • More intelligent recommender systems • promotion of public policies • targeted online political campaigns • targeted advertising • Integration to other predictive tasks • voting intention • sentiment analysis • health (e.g. infectious disease outbreak prediction)
  6. 6. What is it? • Utilise user-generated content to infer attributes (user profiling) • Examples of user attributes : • location (Cheng et al., 2010), gender (Burger et al., 2011), impact (Lampos et al., 2014), age (Rao et al., 2010) • Such tasks are (usually) framed as predictive tasks, i.e. classification or regression
  7. 7. Socioeconomic attributes • Social status influences language use (Bernstein, 1960; Bernstein, 2003; Labov, 2006) • Hypothesis: • Language use in Twitter can be indicative of user demographics • Attributes: • Occupational class • Income • Socioeconomic class
  8. 8. data
  9. 9. SOC taxonomy • Standard Occupational Classification (SOC): • Taxonomy of jobs, grouped by skill requirements • Developed by the UK Office for National Statistics
  10. 10. SOC taxonomy • C1 Corporate Managers and Directors —> chief executive, bank manager • C2 Professional Occupations —> mechanical engineer, pediatrist, research scientist • C3 Associate Professional and Technical Occupations —> system administrator, dispensing optician • C4 Administrative and Secretarial Occupations —> legal clerk, company secretary • C5 Skilled Trades Occupations —> electrical fitter, tailor • C6 Caring, Leisure, Other Service Occupations —> school assistant, hairdresser • C7 Sales and Customer Service Occupations —> sales assistant, telephonist • C8 Process, Plant and Machine Operatives —> factory worker, van driver • C9 Elementary Occupations —> shelf stacker, bartender
  11. 11. Map users to their socioeconomic attributes • Use job titles from SOC to retrieve Twitter accounts • Read the profile info and/or tweets • Remove organisations/companies • Keep only users that annotators agree they belong to a specific class
  12. 12. Map users to their socioeconomic attributes • Use the SOC class as a proxy to find user’s mean income and socioeconomic class • Annual Survey of Hours & Earnings + SOC —> Mean income in £ • Production manager (£50,952/year) • Sales Supervisor (£18,383/year)
  13. 13. Map users to their socioeconomic attributes • Use the SOC class as a proxy to find user’s mean income and socioeconomic class • Socioeconomic coding + SOC —> Socioeconomic class (upper, medium, lower) • Bank manager —> upper • Government clerk —> medium • Factory cleaner —> lower
  14. 14. Data sets • Data Set 1 • 5,191 Twitter users - SOC class - Mean income • 10M tweets (maximum 200 tweets/user) • Publicly available • Data Set 2 • 1,342 Twitter users - SOC class - Socioeconomic class • 2M tweets • Publicly available
  15. 15. models
  16. 16. Supervised Models • Traditional linear models (e.g. logistic regression) • Suport Vector Machines (SVMs) • Gaussian Processes (GPs) • Deep Neural Nets
  17. 17. Supervised Models • We look for two main characteristics: • Model non-linearities • Interpretability
  18. 18. Gaussian Processes (GPs) • Bayesian • kernel-based • non-parametric • non-linear • modelling uncertainty • interpretable models
  19. 19. Gaussian Processes (GPs) x 2 Rd f : Rd ! R f(x) ⇠ GP(m(x), k(x, x0 )) d-dimensional input data learn a function mean function drawn on inputs covariance function (kernel) drawn on pairs of inputs
  20. 20. Gaussian Processes (GPs) f(x) ⇠ GP(m(x), k(x, x0 )) covariance function (kernel) Automatical Relevance Determination (ARD) kernel (Neil, 1996) kard(x, x0 ) = 2 exp h dX i (xi x0 i)2 2l2 i ARD kernel learns feature importance using the lengthscales l —> most discriminative between classes
  21. 21. GPs for Classification • Pass function through a logistic function to squash the input from to obtain probability:( 1, +1) ⇡(x) = p(yi = 1kfi) • Inference using Expectation Propagation
  22. 22. Gaussian Processes (GPs) • More information about GPs: • http://www.gaussianprocess.org/
  23. 23. Predicting the occupational class Preotiuc-Pietro et al. (2015, ACL)
  24. 24. Predicting the socioeconomic class Users Feature vectors GPs SOC class C1 - C9 • 5,191 users mapped to a SOC class • ~10M tweets
  25. 25. Predicting the occupational class 46.9 51.7 52.7 0% 10% 20% 30% 40% C1 C2 C3 C4 C5 C6 C7 C8 C9 Distribution of users in the 9 SOC classes
  26. 26. Predicting the occupational class C2 Professional Occupations
  27. 27. Predicting the occupational class
  28. 28. Predicting the occupational class: Features • User profile (18) • number of followers/friends/listings/tweets • proportion of retweets/hashtags/@-replies/links • average of tweets a day/retweets per tweet • Topics - Word Clusters (200) • Spectral clustering on a word similarity matrix. • Words represented as Word2Vec embeddings (Mikolov et al., 2013). • Similarity is computed as the cosine of the word embeddings.
  29. 29. Predicting the occupational class: Performance Accuracy(%) 0 15 30 45 60 Feature Type User Profile Word2Vec Clusters 52.7 34.2 51.7 31.5 46.9 34 Logistic Regression SVM (RBF) Gaussian Process (ARD)
  30. 30. Predicting the occupational class: Topics Rank Label Topic 1 Arts art, design, print, collection, poster, painting, custom, logo, printing drawing 2 Health risk, cancer, mental, stress, patients, treatment, surgery, disease, drugs, doctor 3 Beauty Care beauty, natural, dry, skin, massage, plastic, spray, facial, treatments, soap 4 Higher Education students, research, board, student, college, education, library, schools, teaching, teachers 5 Software Engineering service, data, system, services, access, security, development, software, testing, standard Most predictive Topics given by ARD ranking
  31. 31. Predicting the occupational class: Topics Rank Label Topic 7 Football van, foster, cole, winger, terry, reckons, youngster, rooney, fielding, kenny 8 Corporate patent, industry, reports, global, survey, leading, firm, 2015, innovation, financial 9 Cooking recipe, meat, salad, egg, soup, sauce, beef, served, pork, rice 12 Elongated Words wait, till, til, yay, ahhh, hoo, woo, woot, whoop, woohoo 16 Politics human, culture, justice, religion, democracy, religious, humanity, tradition, ancient, racism Most predictive Topics given by ARD ranking
  32. 32. Predicting the occupational class: Feature analysis 0.001 0.01 0.05 0 0.2 0.4 0.6 0.8 1 Topic proportion Userprobability Higher Education (#21) C1 C2 C3 C4 C5 C6 C7 C8 C9 Topic more prevalent in a class C1-C9 —> CDF line closer to the bottom-right corner
  33. 33. Predicting the occupational class: Feature analysis 0.001 0.01 0.05 0 0.2 0.4 0.6 0.8 1 Topic proportion Userprobability Arts (#116) C1 C2 C3 C4 C5 C6 C7 C8 C9 Topic more prevalent in a class C1-C9 —> CDF line closer to the bottom-right corner
  34. 34. Predicting the occupational class: Feature analysis 0.001 0.01 0.05 0 0.2 0.4 0.6 0.8 1 Topic proportion Userprobability Elongated Words (#164) C1 C2 C3 C4 C5 C6 C7 C8 C9 Topic more prevalent in a class C1-C9 —> CDF line closer to the bottom-right corner
  35. 35. Predicting the occupational class: Feature analysis Jensen-Shannon Divergence between topic distributions across classes
  36. 36. Predicting the income Preotiuc-Pietro et al. (2015, PloS ONE)
  37. 37. Predicting the income Users Feature vectors GPs £ • 5,191 Twitter users mapped to an average income in GBP (£) • ~10M tweets
  38. 38. Predicting the income 10k 30k 50k 100k 0 200 400 600 800 1000 Yearly income (£) No.Users
  39. 39. Predicting the income: Features • Profile (8): #followers, #followees, times listed etc. • Shallow textual features (10): proportion of hashtags, @-replies etc. • Inferred psycho-demographic features (15): gender, age, education level, religion etc. • Emotions (9): joy, anger, fear, disgust etc. • Word Clusters - Topics (200): Word Embeddings —> Similarity matrix —> Spectral Clustering
  40. 40. Predicting the income: Performance MAE 9000 9750 10500 11250 12000 Feature Type £9,535£9,621 £11,456 £10,980 £10,110 £11,291 Profile Demo Emotion Shallow Topics All Features Mean Absolute Error (MAE) of income inference
  41. 41. Predicting the income: Performance 0.21 0.28 0.22 0.27 0.20 0.50 0.51 0.33 0.26 0.32 0.36 0.26 0.61 0.61 0.37 0.36 0.33 0.37 0.36 0.61 0.63 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 Profile Psycho-Demo Personality Emotions Shallow Topics All Features LR SVM-RBF GP Correlation between actual and predicted income
  42. 42. Predicting the Income: Feature analysis e1: positive (l=46.27) e2: neutral (l=57.64) e3: negative(l=76.34) e4: joy (l=36.37) e5: sadness (l=67.05) e6: disgust (l=116.66) e7: anger (l=95.50) e8: surprise (l=83.61) e9: fear (l=31.74) 28000 35000 42000 28000 35000 42000 28000 35000 42000 0.1 0.2 0.3 0.4 0.5 0.4 0.5 0.6 0.7 0.8 0.9 0.05 0.10 0.15 0.20 0.5 0.6 0.7 0.8 0.05 0.10 0.010 0.015 0.020 0.025 0.030 0.01 0.02 0.03 0.04 0.05 0.10 0.15 0.20 0.25 0.05 0.10 0.15 Feature value Income Linear vs GP fit Relation of income and emotion
  43. 43. Predicting the Income: Feature analysis Linear vs GP fit Topic 107 (Justice) Topic 124 (Corporate 1) Topic 139 (Politics) Topic 163 (NGOs) Topic 196 (Web analytics/Surveys) Topic 99 (Swearing) 30000 40000 50000 30000 40000 50000 0.00 0.02 0.04 0.06 0.00 0.02 0.04 0.000 0.025 0.050 0.075 0.000 0.025 0.050 0.075 0.100 0.00 0.01 0.02 0.03 0.04 0.00 0.03 0.06 0.09 0.12 Feature value Income Relation of income and topics
  44. 44. Predicting the Income: Feature analysis Linear vs GP fit u1: No.followers (l=47.76) u2: No.friends (l=84.48) u3: No.listings (l=2.65) u4: Foll/fr.ratio (l=5.16) u5: No.favs (l=96.41) u6: Tw/day (40.96) u7: No.tweets (l=15.94) u8:English Tw. (l=3.12) 28000 36000 44000 28000 36000 44000 0 2000 4000 0 500 1000 1500 2000 2500 0 50 100 150 2 4 6 0 1000 2000 3000 0 5 10 15 20 0 10000 20000 30000 0.25 0.50 0.75 1.00 Feature value Income Relation of income and profile
  45. 45. Predicting the Income: Feature analysis Relation of income and psycho-demographic features ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30023 36408 30670 32804 34949 32154 24944 32621 27792 35028 31880 34627 32029 32985 Income: Above Average Income: Below Average Religion: Unaffiliated Religion: Christian Gender: Male Gender: Female Ethnicity: Caucassian Ethnicity: African American Education: Degree Education: High School Age: > 35 Age: 30−35 Age: 25−30 Age: < 25 20000 25000 30000 35000 40000 Mean group income (95% CI)
  46. 46. Predicting the socioeconomic class Lampos et al. (2016, ECIR)
  47. 47. Predicting the socioeconomic class Users Feature vectors GPs upper medium lower • 1,342 Twitter users mapped to a socioeconomic class label • ~2M tweets upper medium + lower 3-way 2-way
  48. 48. Predicting the socioeconomic class: Features • User Profile (4) • User bio (523) • Text in tweets (560) • Topics - Word Clusters (200) • User Impact on the platform (4) • Total of 1,291 features
  49. 49. Predicting the socioeconomic class: Performance Classification Accuracy (%) Precision (%) Recall (%) F1 2-way 82.05 (2.4) 82.2 (2.4) 81.97 (2.6) .821 (.03) 3-way 75.09 (3.3) 72.04 (4.4) 70.76 (5.7) .714 (.05) Classification performance (10-fold CV)
  50. 50. Predicting the socioeconomic class: Performance T1 T2 P O1 584 115 83.5% O2 126 517 80.4% R 82.3% 81.8% 82.0% T1 T2 T3 P O1 606 84 53 81.6% O2 49 186 45 66.4% O3 55 48 216 67.7% R 854% 58.5% 68.8% 75.1% Confusion matrices (aggregate) O = output (inferred), T = target, P = precision, R = recall {1, 2, 3} = {upper, middle, lower} socioeconomic status
  51. 51. Conclusions • User-generated content is extremely useful resource • infer user demographics • social science research • commercial tasks • User socio-economic status influences language use in social media • Non-linear models (Gaussian Processes) • better modelling of demographic inference tasks • interpretability • Topic features provide better representations and performance • Qualitative analysis • Insights to interesting patterns
  52. 52. Acknowledgements Daniel Preotiuc-Pietro……………………………………..UPenn Vasileios Lampos……………………………………………..UCL Ingemar J. Cox…………………….UCL & Uni. of Copenhagen Jens K. Geyti………………………………………………….UCL Bin Zou……..………………………………………………….UCL Svitlana Volkova……………………………………………..PNNL Yoram Bachrach…………………………….Microsoft Research
  53. 53. Publications D. Preoţiuc-Pietro, V. Lampos and N. Aletras (2015). An Analysis of the User Occupational Class through Twitter Content. In ACL. D. Preoţiuc-Pietro, S. Volkova, V. Lampos, Y. Bachrach, N. Aletras (2015). Studying User Income through Language, Behaviour and Affect in Social Media. PLOS ONE. V. Lampos, N. Aletras, J. K. Geyti, B. Zou, I. J. Cox (2016). Inferring the Socioeconomic Status of Social Media Users based on Behaviour and Language. In ECIR.
  54. 54. Thank you Questions?

×