Inferring the Socioeconomic Status of Social Media Users based on Behaviour and Language

Funded by the EPSRC IRC project “i-sense” Selected References
Lampos et al. Predicting and Characterising User Impact on Twitter. EACL, 2014.
Preotiuc-Pietro et al. An analysis of the user occupational class through Twitter content. ACL, 2015.
Preotiuc-Pietro et al. Studying User Income through Language, Behaviour and Affect in Social Media. PLoS ONE, 2015.
Rasmussen and Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
Download the data set
Summary. We present a method for determining the
socioeconomic status of a social media (Twitter) user.
Initially, we formulate a 3-way classification task, where
users are classified as having an upper, middle or lower
socioeconomic status. A nonlinear learning approach using
a composite Gaussian Process kernel provides a
classification accuracy of 75%. By turning this task into a
binary classification – upper vs. medium and lower class –
the proposed classifier reaches an accuracy of 82%.
Profile description
football player at
Liverpool FC
tweets from the best
barista in London
estate agent, stamp
collector & proud mother
(523 1-grams & 2-grams)
Behaviour
% re-tweets 
% @mentions 
% unique @mentions
% @replies
Impact
# followers 
# followees 
# listed 
‘impact’ score
Topics of
discussion
Corporate 
Education 
Internet Slang 
Politics 
Shopping 
Sports 
Vacation 
…
(200 topics)
Text in posts
(560 1-grams)
c1
c2 c3
c4
c5
Twitter user attributes (feature categories)
A
How is a user profile mapped to a socioeconomic status?
Profile description
on Twitter
Occupation SOC category1 NS-SEC2
1. Standard Occupational Classification: 369 job groupings
2. National Statistics Socio-Economic Classification: Map from
the job groupings in SOC to a socioeconomic status, i.e.
{upper, middle or lower}
B
Data sets
T1: 1,342 Twitter user profiles, 2 million tweets, from February
1, 2014 to March 21, 2015; profiles are labelled with a socio-
economic status
T2: 160 million tweets, sample of UK Twitter, same date range
with T1, used to learn a set of 200 latent topics
C
1-gram samples from a subset of the 200 latent topics (word clusters) ex-
utomatically from Twitter data (D2).
ic Sample of 1-grams
rate #business, clients, development, marketing, o ces, product
tion assignments, coursework, dissertation, essay, library, notes, studies
ily #family, auntie, dad, family, mother, nephew, sister, uncle
Slang ahahaha, awwww, hahaa, hahahaha, hmmmm, loooool, oooo, yay
ics #labour, #politics, #tories, conservatives, democracy, voters
ping #shopping, asda, bargain, customers, market, retail, shops, toys
rts #football, #winner, ball, bench, defending, footballer, goal, won
rtime #beach, #sea, #summer, #sunshine, bbq, hot, seaside, swimming
rism #jesuischarlie, cartoon, freedom, religion, shootings, terrorism
ams) and 560 (1-grams) respectively. Thus, a Twitter user in our data
resented by a 1, 291-dimensional feature vector.
pplied spectral clustering [12] on D2 to derive 200 (hard) clusters of
that capture a number of latent topics and linguistic expressions (e.g.
, ‘Sports’, ‘Internet Slang’), a snapshot of which is presented in Ta-
evious research has shown that this amount of clusters is adequate for
g a strong performance in similar tasks [7,13,14]. We then computed the
y of each topic in the tweets of D1 as described in feature category c5.
btain a SES label for each user account, we took advantage of the SOC
y’s characteristics [5]. In SOC, jobs are categorised based on the required
l and specialisation. At the top level, there exist 9 general occupation
and the scheme breaks down to sub-categories forming a 4-level struc-
e bottom of this hierarchy contains more specific job groupings (369 in
OC also provides a simplified mapping from these job groupings to a
efined by NS-SEC [17]. We used this mapping to assign an upper, mid-
wer SES to each user account in our data set. This process resulted in
and 314 users in the upper, middle and lower SES classes, respectively.2
assification Methods
a composite Gaussian Process (GP), described below, as our main
for performing classification. GPs can be defined as sets of random
, any finite number of which have a multivariate Gaussian distribution
mally, GP methods aim to learn a function f : Rd
! R drawn from a
given the inputs x 2 Rd
:
f(x) ⇠ GP(m(x), k(x, x0
)) , (1)
(·) is the mean function (here set equal to 0) and k(·, ·) is the covari-
nel. We apply the squared exponential (SE) kernel, also known as the
ta set is available at http://dx.doi.org/10.6084/m9.figshare.1619703.
subset of the 200 latent topics (word clusters) ex-
r data (D2).
Sample of 1-grams
ents, development, marketing, o ces, product
rsework, dissertation, essay, library, notes, studies
tie, dad, family, mother, nephew, sister, uncle
w, hahaa, hahahaha, hmmmm, loooool, oooo, yay
itics, #tories, conservatives, democracy, voters
a, bargain, customers, market, retail, shops, toys
ner, ball, bench, defending, footballer, goal, won
summer, #sunshine, bbq, hot, seaside, swimming
cartoon, freedom, religion, shootings, terrorism
) respectively. Thus, a Twitter user in our data
mensional feature vector.
ng [12] on D2 to derive 200 (hard) clusters of
of latent topics and linguistic expressions (e.g.
ng’), a snapshot of which is presented in Ta-
wn that this amount of clusters is adequate for
n similar tasks [7,13,14]. We then computed the
weets of D1 as described in feature category c5.
ch user account, we took advantage of the SOC
SOC, jobs are categorised based on the required
the top level, there exist 9 general occupation
down to sub-categories forming a 4-level struc-
hy contains more specific job groupings (369 in
plified mapping from these job groupings to a
We used this mapping to assign an upper, mid-
ccount in our data set. This process resulted in
per, middle and lower SES classes, respectively.2
ds
Process (GP), described below, as our main
ation. GPs can be defined as sets of random
which have a multivariate Gaussian distribution
to learn a function f : Rd
! R drawn from a
d
:
⇠ GP(m(x), k(x, x0
)) , (1)
n (here set equal to 0) and k(·, ·) is the covari-
red exponential (SE) kernel, also known as the
://dx.doi.org/10.6084/m9.figshare.1619703.
Table 1. 1-gram samples from a subset of the 200 latent topics (word clusters) ex-
tracted automatically from Twitter data (D2).
Topic Sample of 1-grams
Corporate #business, clients, development, marketing, o ces, product
Education assignments, coursework, dissertation, essay, library, notes, studies
Family #family, auntie, dad, family, mother, nephew, sister, uncle
Internet Slang ahahaha, awwww, hahaa, hahahaha, hmmmm, loooool, oooo, yay
Politics #labour, #politics, #tories, conservatives, democracy, voters
Shopping #shopping, asda, bargain, customers, market, retail, shops, toys
Sports #football, #winner, ball, bench, defending, footballer, goal, won
Summertime #beach, #sea, #summer, #sunshine, bbq, hot, seaside, swimming
Terrorism #jesuischarlie, cartoon, freedom, religion, shootings, terrorism
plus 2-grams) and 560 (1-grams) respectively. Thus, a Twitter user in our data
set is represented by a 1, 291-dimensional feature vector.
We applied spectral clustering [12] on D2 to derive 200 (hard) clusters of
1-grams that capture a number of latent topics and linguistic expressions (e.g.
‘Politics’, ‘Sports’, ‘Internet Slang’), a snapshot of which is presented in Ta-
ble 1. Previous research has shown that this amount of clusters is adequate for
achieving a strong performance in similar tasks [7,13,14]. We then computed the
frequency of each topic in the tweets of D1 as described in feature category c5.
To obtain a SES label for each user account, we took advantage of the SOC
hierarchy’s characteristics [5]. In SOC, jobs are categorised based on the required
skill level and specialisation. At the top level, there exist 9 general occupation
groups, and the scheme breaks down to sub-categories forming a 4-level struc-
ture. The bottom of this hierarchy contains more specific job groupings (369 in
total). SOC also provides a simplified mapping from these job groupings to a
SES as defined by NS-SEC [17]. We used this mapping to assign an upper, mid-
dle or lower SES to each user account in our data set. This process resulted in
710, 318 and 314 users in the upper, middle and lower SES classes, respectively.2
3 Classification Methods
We use a composite Gaussian Process (GP), described below, as our main
method for performing classification. GPs can be defined as sets of random
variables, any finite number of which have a multivariate Gaussian distribution
[16]. Formally, GP methods aim to learn a function f : Rd
! R drawn from a
GP prior given the inputs x 2 Rd
:
f(x) ⇠ GP(m(x), k(x, x0
)) , (1)
where m(·) is the mean function (here set equal to 0) and k(·, ·) is the covari-
ance kernel. We apply the squared exponential (SE) kernel, also known as the
2
The data set is available at http://dx.doi.org/10.6084/m9.figshare.1619703.
Definition:
Kernel
formulation:
ation mean performance as estimated via a 10-fold cross valida-
GP classifier for both problem specifications. Parentheses hold
timate.
Accuracy Precision Recall F-score
5.09% (3.28%) 72.04% (4.40%) 70.76% (5.65%) .714 (.049)
2.05% (2.41%) 82.20% (2.39%) 81.97% (2.55%) .821 (.025)
(RBF), defined as kSE(x, x0
) = ✓2
exp kx x0
k2
2/(2`2
) ,
nt that describes the overall level of variance and ` is re-
acteristic length-scale parameter. Note that ` is inversely
redictive relevancy of x (high values indicate a low degree
classification using GPs ‘squashes’ the real valued latent
through a logistic function: ⇡(x) , P(y = 1|x) = (f(x))
ogistic regression classification. In binary classification, the
latent f⇤ is combined with the logistic function to produceR
(f⇤)P(f⇤|x, y, x⇤)df⇤. The posterior formulation has a
od and thus, the model parameters can only be estimated.
use the Laplace approximation [16,18].
Table 2. SES classification mean performance as estimated via a 10-fold cross valida-
tion of the composite GP classifier for both problem specifications. Parentheses hold
the SD of the mean estimate.
Num. of classes Accuracy Precision Recall F-score
3 75.09% (3.28%) 72.04% (4.40%) 70.76% (5.65%) .714 (.049)
2 82.05% (2.41%) 82.20% (2.39%) 81.97% (2.55%) .821 (.025)
radial basis function (RBF), defined as kSE(x, x0
) = ✓2
exp kx x0
k2
2/(2`2
) ,
where ✓2
is a constant that describes the overall level of variance and ` is re-
ferred to as the characteristic length-scale parameter. Note that ` is inversely
proportional to the predictive relevancy of x (high values indicate a low degree
of relevance). Binary classification using GPs ‘squashes’ the real valued latent
function f(x) output through a logistic function: ⇡(x) , P(y = 1|x) = (f(x))
in a similar way to logistic regression classification. In binary classification, the
distribution over the latent f⇤ is combined with the logistic function to produce
the prediction ¯⇡⇤ =
R
non-Gaussian likelihood and thus, the model parameters can only be estimated.
For this purpose we use the Laplace approximation [16,18].
Based on the property that the sum of covariance functions is also a valid
covariance function [16], we model the di↵erent user feature categories with a
di↵erent SE kernel. The final covariance function, therefore, becomes
k(x, x0
) =
CX
n=1
kSE(cn, c0
n)
!
+ kN(x, x0
) , (2)
where cn is used to express the features of each category, i.e., x = {c1, . . . , cC,},
C is equal to the number of feature categories (in our experimental setup, C = 5)
and kN(x, x0
) = ✓2
N ⇥ (x, x0
) models noise ( being a Kronecker delta func-
tion). Similar GP kernel formulations have been applied for text regression tasks
[7,9,11] as a way of capturing groupings of the feature space more e↵ectively.
Although related work has indicated the superiority of nonlinear approaches
in similar multimodal tasks [7,14], we also estimate a performance baseline us-
ing a linear method. Given the high dimensionality of our task, we apply logistic
regression with elastic net regularisation [6] for this purpose. As both classifica-
tion techniques can address binary tasks, we adopt the one–vs.–all strategy for
conducting an inference.
Table 2. SES classification mean performance as estimated via a 10-fold cross valida-
tion of the composite GP classifier for both problem specifications. Parentheses hold
the SD of the mean estimate.
Num. of classes Accuracy Precision Recall F-score
3 75.09% (3.28%) 72.04% (4.40%) 70.76% (5.65%) .714 (.049)
2 82.05% (2.41%) 82.20% (2.39%) 81.97% (2.55%) .821 (.025)
radial basis function (RBF), defined as kSE(x, x0
) = ✓2
exp kx x0
k2
2/(2`2
) ,
where ✓2
is a constant that describes the overall level of variance and ` is re-
ferred to as the characteristic length-scale parameter. Note that ` is inversely
proportional to the predictive relevancy of x (high values indicate a low degree
of relevance). Binary classification using GPs ‘squashes’ the real valued latent
function f(x) output through a logistic function: ⇡(x) , P(y = 1|x) = (f(x))
in a similar way to logistic regression classification. In binary classification, the
distribution over the latent f⇤ is combined with the logistic function to produce
the prediction ¯⇡⇤ =
R
non-Gaussian likelihood and thus, the model parameters can only be estimated.
For this purpose we use the Laplace approximation [16,18].
Based on the property that the sum of covariance functions is also a valid
covariance function [16], we model the di↵erent user feature categories with a
di↵erent SE kernel. The final covariance function, therefore, becomes
k(x, x0
) =
CX
n=1
kSE(cn, c0
n)
!
+ kN(x, x0
) , (2)
where cn is used to express the features of each category, i.e., x = {c1, . . . , cC,},
C is equal to the number of feature categories (in our experimental setup, C = 5)
and kN(x, x0
) = ✓2
N ⇥ (x, x0
) models noise ( being a Kronecker delta func-
tion). Similar GP kernel formulations have been applied for text regression tasks
[7,9,11] as a way of capturing groupings of the feature space more e↵ectively.
Although related work has indicated the superiority of nonlinear approaches
in similar multimodal tasks [7,14], we also estimate a performance baseline us-
ing a linear method. Given the high dimensionality of our task, we apply logistic
regression with elastic net regularisation [6] for this purpose. As both classifica-
tion techniques can address binary tasks, we adopt the one–vs.–all strategy for
conducting an inference.
ce as estimated via a 10-fold cross valida-
problem specifications. Parentheses hold
ecision Recall F-score
% (4.40%) 70.76% (5.65%) .714 (.049)
% (2.39%) 81.97% (2.55%) .821 (.025)
SE(x, x0
) = ✓2
exp kx x0
k2
2/(2`2
) ,
e overall level of variance and ` is re-
le parameter. Note that ` is inversely
of x (high values indicate a low degree
GPs ‘squashes’ the real valued latent
unction: ⇡(x) , P(y = 1|x) = (f(x))
ssification. In binary classification, the
d with the logistic function to produce
)df⇤. The posterior formulation has a
odel parameters can only be estimated.
roximation [16,18].
of covariance functions is also a valid
di↵erent user feature categories with a
function, therefore, becomes
n, c0
n)
!
+ kN(x, x0
) , (2)
each category, i.e., x = {c1, . . . , cC}, C
ies (in our experimental setup, C = 5)
oise ( being a Kronecker delta func-
e been applied for text regression tasks
of the feature space more e↵ectively.
the superiority of nonlinear approaches
so estimate a performance baseline us-
nsionality of our task, we apply logistic
[6] for this purpose. As both classifica-
, we adopt the one–vs.–all strategy for
erformance as estimated via a 10-fold cross valida-
for both problem specifications. Parentheses hold
Precision Recall F-score
) 72.04% (4.40%) 70.76% (5.65%) .714 (.049)
) 82.20% (2.39%) 81.97% (2.55%) .821 (.025)
ned as kSE(x, x0
) = ✓2
exp kx x0
k2
2/(2`2
) ,
ribes the overall level of variance and ` is re-
ngth-scale parameter. Note that ` is inversely
evancy of x (high values indicate a low degree
n using GPs ‘squashes’ the real valued latent
ogistic function: ⇡(x) , P(y = 1|x) = (f(x))
sion classification. In binary classification, the
combined with the logistic function to produce
|x, y, x⇤)df⇤. The posterior formulation has a
, the model parameters can only be estimated.
ace approximation [16,18].
he sum of covariance functions is also a valid
el the di↵erent user feature categories with a
ariance function, therefore, becomes
X
1
kSE(cn, c0
n)
!
+ kN(x, x0
) , (2)
tures of each category, i.e., x = {c1, . . . , cC}, C
categories (in our experimental setup, C = 5)
odels noise ( being a Kronecker delta func-
ons have been applied for text regression tasks
upings of the feature space more e↵ectively.
dicated the superiority of nonlinear approaches
], we also estimate a performance baseline us-
gh dimensionality of our task, we apply logistic
isation [6] for this purpose. As both classifica-
y tasks, we adopt the one–vs.–all strategy for
where
Formulating a Gaussian Process classifier E
Topics (word clusters) are formed by applying  
spectral clustering on daily word frequencies in T2.
Examples of topics with word samples
Corporate: #business, clients, development, marketing, offices
Education: assignments, coursework, dissertation, essay, library
Internet Slang: ahahaha, awwww, hahaa, hahahaha, hmmmm
Politics: #labour, #politics, #tories, conservatives, democracy
Shopping: #shopping, asda, bargain, customers, market, retail
Sports: #football, #winner, ball, bench, defending, footballer
D
Classification Accuracy (%) Precision (%) Recall (%) F1
2-way 82.05 (2.4) 82.2 (2.4) 81.97 (2.6) .821 (.03)
3-way 75.09 (3.3) 72.04 (4.4) 70.76 (5.7) .714 (.05)
Classification performance (10-fold CV)
T1 T2 P
O1 584 115 83.5%
O2 126 517 80.4%
R 82.3% 81.8% 82.0%
T1 T2 T3 P
O1 606 84 53 81.6%
O2 49 186 45 66.4%
O3 55 48 216 67.7%
R 854% 58.5% 68.8% 75.1%
Confusion matrices (aggregate)
O = output (inferred), T = target, P = precision, R = recall
{1, 2, 3} = {upper, middle, lower} socioeconomic status
F
Conclusions. (a) First approach for inferring the socioeconomic
status of a social media user, (b) 75% & 82% accuracy for the 3-
way and binary classification tasks respectively, and (c) future
work is required to evaluate this framework more rigorously
and to analyse underlying qualitative properties in detail.
Inferring the Socioeconomic Status of Social
Media Users based on Behaviour & Language
Vasileios Lampos, Nikolaos Aletras,
Jens K. Geyti, Bin Zou & Ingemar J. Cox

Inferring the Socioeconomic Status of Social Media Users based on Behaviour and Language

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Viewers also liked

Viewers also liked (11)

Similar to Inferring the Socioeconomic Status of Social Media Users based on Behaviour and Language

Similar to Inferring the Socioeconomic Status of Social Media Users based on Behaviour and Language (20)

More from Vasileios Lampos

More from Vasileios Lampos (10)

Recently uploaded

Recently uploaded (20)

Inferring the Socioeconomic Status of Social Media Users based on Behaviour and Language