Inferring the Socioeconomic Status of Social Media Users based on Behaviour and Language

859 views

Published on

This paper presents a method to classify social media users based on their socioeconomic status. Our experiments are conducted on a curated set of Twitter profiles, where each user is represented by the posted text, topics of discussion, interactive behaviour and estimated impact on the microblogging platform. Initially, we formulate a 3-way classification task, where users are classified as having an upper, middle or lower socioeconomic status. A nonlinear, generative learning approach using a composite Gaussian Process kernel provides significantly better classification accuracy (75%) than a competitive linear alternative. By turning this task into a binary classification - upper vs. medium and lower class - the proposed classifier reaches an accuracy of 82%.

Published in: Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
859
On SlideShare
0
From Embeds
0
Number of Embeds
93
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Inferring the Socioeconomic Status of Social Media Users based on Behaviour and Language

  1. 1. Funded by the EPSRC IRC project “i-sense” Selected References Lampos et al. Predicting and Characterising User Impact on Twitter. EACL, 2014. Preotiuc-Pietro et al. An analysis of the user occupational class through Twitter content. ACL, 2015. Preotiuc-Pietro et al. Studying User Income through Language, Behaviour and Affect in Social Media. PLoS ONE, 2015. Rasmussen and Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. Download the data set Summary. We present a method for determining the socioeconomic status of a social media (Twitter) user. Initially, we formulate a 3-way classification task, where users are classified as having an upper, middle or lower socioeconomic status. A nonlinear learning approach using a composite Gaussian Process kernel provides a classification accuracy of 75%. By turning this task into a binary classification – upper vs. medium and lower class – the proposed classifier reaches an accuracy of 82%. Profile description football player at Liverpool FC tweets from the best barista in London estate agent, stamp collector & proud mother (523 1-grams & 2-grams) Behaviour % re-tweets
 % @mentions
 % unique @mentions % @replies Impact # followers
 # followees
 # listed
 ‘impact’ score Topics of discussion Corporate
 Education
 Internet Slang
 Politics
 Shopping
 Sports
 Vacation
 … (200 topics) Text in posts (560 1-grams) c1 c2 c3 c4 c5 Twitter user attributes (feature categories) A How is a user profile mapped to a socioeconomic status? Profile description on Twitter Occupation SOC category1 NS-SEC2 1. Standard Occupational Classification: 369 job groupings 2. National Statistics Socio-Economic Classification: Map from the job groupings in SOC to a socioeconomic status, i.e. {upper, middle or lower} B Data sets T1: 1,342 Twitter user profiles, 2 million tweets, from February 1, 2014 to March 21, 2015; profiles are labelled with a socio- economic status T2: 160 million tweets, sample of UK Twitter, same date range with T1, used to learn a set of 200 latent topics C 1-gram samples from a subset of the 200 latent topics (word clusters) ex- utomatically from Twitter data (D2). ic Sample of 1-grams rate #business, clients, development, marketing, o ces, product tion assignments, coursework, dissertation, essay, library, notes, studies ily #family, auntie, dad, family, mother, nephew, sister, uncle Slang ahahaha, awwww, hahaa, hahahaha, hmmmm, loooool, oooo, yay ics #labour, #politics, #tories, conservatives, democracy, voters ping #shopping, asda, bargain, customers, market, retail, shops, toys rts #football, #winner, ball, bench, defending, footballer, goal, won rtime #beach, #sea, #summer, #sunshine, bbq, hot, seaside, swimming rism #jesuischarlie, cartoon, freedom, religion, shootings, terrorism ams) and 560 (1-grams) respectively. Thus, a Twitter user in our data resented by a 1, 291-dimensional feature vector. pplied spectral clustering [12] on D2 to derive 200 (hard) clusters of that capture a number of latent topics and linguistic expressions (e.g. , ‘Sports’, ‘Internet Slang’), a snapshot of which is presented in Ta- evious research has shown that this amount of clusters is adequate for g a strong performance in similar tasks [7,13,14]. We then computed the y of each topic in the tweets of D1 as described in feature category c5. btain a SES label for each user account, we took advantage of the SOC y’s characteristics [5]. In SOC, jobs are categorised based on the required l and specialisation. At the top level, there exist 9 general occupation and the scheme breaks down to sub-categories forming a 4-level struc- e bottom of this hierarchy contains more specific job groupings (369 in OC also provides a simplified mapping from these job groupings to a efined by NS-SEC [17]. We used this mapping to assign an upper, mid- wer SES to each user account in our data set. This process resulted in and 314 users in the upper, middle and lower SES classes, respectively.2 assification Methods a composite Gaussian Process (GP), described below, as our main for performing classification. GPs can be defined as sets of random , any finite number of which have a multivariate Gaussian distribution mally, GP methods aim to learn a function f : Rd ! R drawn from a given the inputs x 2 Rd : f(x) ⇠ GP(m(x), k(x, x0 )) , (1) (·) is the mean function (here set equal to 0) and k(·, ·) is the covari- nel. We apply the squared exponential (SE) kernel, also known as the ta set is available at http://dx.doi.org/10.6084/m9.figshare.1619703. subset of the 200 latent topics (word clusters) ex- r data (D2). Sample of 1-grams ents, development, marketing, o ces, product rsework, dissertation, essay, library, notes, studies tie, dad, family, mother, nephew, sister, uncle w, hahaa, hahahaha, hmmmm, loooool, oooo, yay itics, #tories, conservatives, democracy, voters a, bargain, customers, market, retail, shops, toys ner, ball, bench, defending, footballer, goal, won summer, #sunshine, bbq, hot, seaside, swimming cartoon, freedom, religion, shootings, terrorism ) respectively. Thus, a Twitter user in our data mensional feature vector. ng [12] on D2 to derive 200 (hard) clusters of of latent topics and linguistic expressions (e.g. ng’), a snapshot of which is presented in Ta- wn that this amount of clusters is adequate for n similar tasks [7,13,14]. We then computed the weets of D1 as described in feature category c5. ch user account, we took advantage of the SOC SOC, jobs are categorised based on the required the top level, there exist 9 general occupation down to sub-categories forming a 4-level struc- hy contains more specific job groupings (369 in plified mapping from these job groupings to a We used this mapping to assign an upper, mid- ccount in our data set. This process resulted in per, middle and lower SES classes, respectively.2 ds Process (GP), described below, as our main ation. GPs can be defined as sets of random which have a multivariate Gaussian distribution to learn a function f : Rd ! R drawn from a d : ⇠ GP(m(x), k(x, x0 )) , (1) n (here set equal to 0) and k(·, ·) is the covari- red exponential (SE) kernel, also known as the ://dx.doi.org/10.6084/m9.figshare.1619703. Table 1. 1-gram samples from a subset of the 200 latent topics (word clusters) ex- tracted automatically from Twitter data (D2). Topic Sample of 1-grams Corporate #business, clients, development, marketing, o ces, product Education assignments, coursework, dissertation, essay, library, notes, studies Family #family, auntie, dad, family, mother, nephew, sister, uncle Internet Slang ahahaha, awwww, hahaa, hahahaha, hmmmm, loooool, oooo, yay Politics #labour, #politics, #tories, conservatives, democracy, voters Shopping #shopping, asda, bargain, customers, market, retail, shops, toys Sports #football, #winner, ball, bench, defending, footballer, goal, won Summertime #beach, #sea, #summer, #sunshine, bbq, hot, seaside, swimming Terrorism #jesuischarlie, cartoon, freedom, religion, shootings, terrorism plus 2-grams) and 560 (1-grams) respectively. Thus, a Twitter user in our data set is represented by a 1, 291-dimensional feature vector. We applied spectral clustering [12] on D2 to derive 200 (hard) clusters of 1-grams that capture a number of latent topics and linguistic expressions (e.g. ‘Politics’, ‘Sports’, ‘Internet Slang’), a snapshot of which is presented in Ta- ble 1. Previous research has shown that this amount of clusters is adequate for achieving a strong performance in similar tasks [7,13,14]. We then computed the frequency of each topic in the tweets of D1 as described in feature category c5. To obtain a SES label for each user account, we took advantage of the SOC hierarchy’s characteristics [5]. In SOC, jobs are categorised based on the required skill level and specialisation. At the top level, there exist 9 general occupation groups, and the scheme breaks down to sub-categories forming a 4-level struc- ture. The bottom of this hierarchy contains more specific job groupings (369 in total). SOC also provides a simplified mapping from these job groupings to a SES as defined by NS-SEC [17]. We used this mapping to assign an upper, mid- dle or lower SES to each user account in our data set. This process resulted in 710, 318 and 314 users in the upper, middle and lower SES classes, respectively.2 3 Classification Methods We use a composite Gaussian Process (GP), described below, as our main method for performing classification. GPs can be defined as sets of random variables, any finite number of which have a multivariate Gaussian distribution [16]. Formally, GP methods aim to learn a function f : Rd ! R drawn from a GP prior given the inputs x 2 Rd : f(x) ⇠ GP(m(x), k(x, x0 )) , (1) where m(·) is the mean function (here set equal to 0) and k(·, ·) is the covari- ance kernel. We apply the squared exponential (SE) kernel, also known as the 2 The data set is available at http://dx.doi.org/10.6084/m9.figshare.1619703. Definition: Kernel formulation: ation mean performance as estimated via a 10-fold cross valida- GP classifier for both problem specifications. Parentheses hold timate. Accuracy Precision Recall F-score 5.09% (3.28%) 72.04% (4.40%) 70.76% (5.65%) .714 (.049) 2.05% (2.41%) 82.20% (2.39%) 81.97% (2.55%) .821 (.025) (RBF), defined as kSE(x, x0 ) = ✓2 exp kx x0 k2 2/(2`2 ) , nt that describes the overall level of variance and ` is re- acteristic length-scale parameter. Note that ` is inversely redictive relevancy of x (high values indicate a low degree classification using GPs ‘squashes’ the real valued latent through a logistic function: ⇡(x) , P(y = 1|x) = (f(x)) ogistic regression classification. In binary classification, the latent f⇤ is combined with the logistic function to produceR (f⇤)P(f⇤|x, y, x⇤)df⇤. The posterior formulation has a od and thus, the model parameters can only be estimated. use the Laplace approximation [16,18]. Table 2. SES classification mean performance as estimated via a 10-fold cross valida- tion of the composite GP classifier for both problem specifications. Parentheses hold the SD of the mean estimate. Num. of classes Accuracy Precision Recall F-score 3 75.09% (3.28%) 72.04% (4.40%) 70.76% (5.65%) .714 (.049) 2 82.05% (2.41%) 82.20% (2.39%) 81.97% (2.55%) .821 (.025) radial basis function (RBF), defined as kSE(x, x0 ) = ✓2 exp kx x0 k2 2/(2`2 ) , where ✓2 is a constant that describes the overall level of variance and ` is re- ferred to as the characteristic length-scale parameter. Note that ` is inversely proportional to the predictive relevancy of x (high values indicate a low degree of relevance). Binary classification using GPs ‘squashes’ the real valued latent function f(x) output through a logistic function: ⇡(x) , P(y = 1|x) = (f(x)) in a similar way to logistic regression classification. In binary classification, the distribution over the latent f⇤ is combined with the logistic function to produce the prediction ¯⇡⇤ = R (f⇤)P(f⇤|x, y, x⇤)df⇤. The posterior formulation has a non-Gaussian likelihood and thus, the model parameters can only be estimated. For this purpose we use the Laplace approximation [16,18]. Based on the property that the sum of covariance functions is also a valid covariance function [16], we model the di↵erent user feature categories with a di↵erent SE kernel. The final covariance function, therefore, becomes k(x, x0 ) = CX n=1 kSE(cn, c0 n) ! + kN(x, x0 ) , (2) where cn is used to express the features of each category, i.e., x = {c1, . . . , cC,}, C is equal to the number of feature categories (in our experimental setup, C = 5) and kN(x, x0 ) = ✓2 N ⇥ (x, x0 ) models noise ( being a Kronecker delta func- tion). Similar GP kernel formulations have been applied for text regression tasks [7,9,11] as a way of capturing groupings of the feature space more e↵ectively. Although related work has indicated the superiority of nonlinear approaches in similar multimodal tasks [7,14], we also estimate a performance baseline us- ing a linear method. Given the high dimensionality of our task, we apply logistic regression with elastic net regularisation [6] for this purpose. As both classifica- tion techniques can address binary tasks, we adopt the one–vs.–all strategy for conducting an inference. Table 2. SES classification mean performance as estimated via a 10-fold cross valida- tion of the composite GP classifier for both problem specifications. Parentheses hold the SD of the mean estimate. Num. of classes Accuracy Precision Recall F-score 3 75.09% (3.28%) 72.04% (4.40%) 70.76% (5.65%) .714 (.049) 2 82.05% (2.41%) 82.20% (2.39%) 81.97% (2.55%) .821 (.025) radial basis function (RBF), defined as kSE(x, x0 ) = ✓2 exp kx x0 k2 2/(2`2 ) , where ✓2 is a constant that describes the overall level of variance and ` is re- ferred to as the characteristic length-scale parameter. Note that ` is inversely proportional to the predictive relevancy of x (high values indicate a low degree of relevance). Binary classification using GPs ‘squashes’ the real valued latent function f(x) output through a logistic function: ⇡(x) , P(y = 1|x) = (f(x)) in a similar way to logistic regression classification. In binary classification, the distribution over the latent f⇤ is combined with the logistic function to produce the prediction ¯⇡⇤ = R (f⇤)P(f⇤|x, y, x⇤)df⇤. The posterior formulation has a non-Gaussian likelihood and thus, the model parameters can only be estimated. For this purpose we use the Laplace approximation [16,18]. Based on the property that the sum of covariance functions is also a valid covariance function [16], we model the di↵erent user feature categories with a di↵erent SE kernel. The final covariance function, therefore, becomes k(x, x0 ) = CX n=1 kSE(cn, c0 n) ! + kN(x, x0 ) , (2) where cn is used to express the features of each category, i.e., x = {c1, . . . , cC,}, C is equal to the number of feature categories (in our experimental setup, C = 5) and kN(x, x0 ) = ✓2 N ⇥ (x, x0 ) models noise ( being a Kronecker delta func- tion). Similar GP kernel formulations have been applied for text regression tasks [7,9,11] as a way of capturing groupings of the feature space more e↵ectively. Although related work has indicated the superiority of nonlinear approaches in similar multimodal tasks [7,14], we also estimate a performance baseline us- ing a linear method. Given the high dimensionality of our task, we apply logistic regression with elastic net regularisation [6] for this purpose. As both classifica- tion techniques can address binary tasks, we adopt the one–vs.–all strategy for conducting an inference. ce as estimated via a 10-fold cross valida- problem specifications. Parentheses hold ecision Recall F-score % (4.40%) 70.76% (5.65%) .714 (.049) % (2.39%) 81.97% (2.55%) .821 (.025) SE(x, x0 ) = ✓2 exp kx x0 k2 2/(2`2 ) , e overall level of variance and ` is re- le parameter. Note that ` is inversely of x (high values indicate a low degree GPs ‘squashes’ the real valued latent unction: ⇡(x) , P(y = 1|x) = (f(x)) ssification. In binary classification, the d with the logistic function to produce )df⇤. The posterior formulation has a odel parameters can only be estimated. roximation [16,18]. of covariance functions is also a valid di↵erent user feature categories with a function, therefore, becomes n, c0 n) ! + kN(x, x0 ) , (2) each category, i.e., x = {c1, . . . , cC}, C ies (in our experimental setup, C = 5) oise ( being a Kronecker delta func- e been applied for text regression tasks of the feature space more e↵ectively. the superiority of nonlinear approaches so estimate a performance baseline us- nsionality of our task, we apply logistic [6] for this purpose. As both classifica- , we adopt the one–vs.–all strategy for erformance as estimated via a 10-fold cross valida- for both problem specifications. Parentheses hold Precision Recall F-score ) 72.04% (4.40%) 70.76% (5.65%) .714 (.049) ) 82.20% (2.39%) 81.97% (2.55%) .821 (.025) ned as kSE(x, x0 ) = ✓2 exp kx x0 k2 2/(2`2 ) , ribes the overall level of variance and ` is re- ngth-scale parameter. Note that ` is inversely evancy of x (high values indicate a low degree n using GPs ‘squashes’ the real valued latent ogistic function: ⇡(x) , P(y = 1|x) = (f(x)) sion classification. In binary classification, the combined with the logistic function to produce |x, y, x⇤)df⇤. The posterior formulation has a , the model parameters can only be estimated. ace approximation [16,18]. he sum of covariance functions is also a valid el the di↵erent user feature categories with a ariance function, therefore, becomes X 1 kSE(cn, c0 n) ! + kN(x, x0 ) , (2) tures of each category, i.e., x = {c1, . . . , cC}, C categories (in our experimental setup, C = 5) odels noise ( being a Kronecker delta func- ons have been applied for text regression tasks upings of the feature space more e↵ectively. dicated the superiority of nonlinear approaches ], we also estimate a performance baseline us- gh dimensionality of our task, we apply logistic isation [6] for this purpose. As both classifica- y tasks, we adopt the one–vs.–all strategy for where Formulating a Gaussian Process classifier E Topics (word clusters) are formed by applying 
 spectral clustering on daily word frequencies in T2. Examples of topics with word samples Corporate: #business, clients, development, marketing, offices Education: assignments, coursework, dissertation, essay, library Internet Slang: ahahaha, awwww, hahaa, hahahaha, hmmmm Politics: #labour, #politics, #tories, conservatives, democracy Shopping: #shopping, asda, bargain, customers, market, retail Sports: #football, #winner, ball, bench, defending, footballer D Classification Accuracy (%) Precision (%) Recall (%) F1 2-way 82.05 (2.4) 82.2 (2.4) 81.97 (2.6) .821 (.03) 3-way 75.09 (3.3) 72.04 (4.4) 70.76 (5.7) .714 (.05) Classification performance (10-fold CV) T1 T2 P O1 584 115 83.5% O2 126 517 80.4% R 82.3% 81.8% 82.0% T1 T2 T3 P O1 606 84 53 81.6% O2 49 186 45 66.4% O3 55 48 216 67.7% R 854% 58.5% 68.8% 75.1% Confusion matrices (aggregate) O = output (inferred), T = target, P = precision, R = recall {1, 2, 3} = {upper, middle, lower} socioeconomic status F Conclusions. (a) First approach for inferring the socioeconomic status of a social media user, (b) 75% & 82% accuracy for the 3- way and binary classification tasks respectively, and (c) future work is required to evaluate this framework more rigorously and to analyse underlying qualitative properties in detail. Inferring the Socioeconomic Status of Social Media Users based on Behaviour & Language Vasileios Lampos, Nikolaos Aletras, Jens K. Geyti, Bin Zou & Ingemar J. Cox

×