Extracting information from ' messy' social media data

Piet Daas, Marco Puts, Ali Hürriyetoglu
Extracting information from
‘messy’ social media data

Using Big Data for official statistics
– Can and how can we use Big Data for the production of
official statistics?
– Statistics Netherlands produces reliable and consistent
statistical information
‐ The official statistics of the country
– These figures are based on target populations
‐ E.g. the country, its inhabitants and its companies
– We want to use as much data as is (freely) available
‐ Less questionnaires, use more administrative and Big Data
Combining this is challenging !!2

It is important to know that
– Statistics Netherlands is the first organization that has
produced a Big Data based official statistics
‐ Road sensor data based traffic intensity statistics
– Statistics Netherlands is the leading organisation in the
official statistical world regarding the use of Big Data
– Have recently created a ‘Center for Big Data Statistics’
‐ With many partners involved (> 30)
3

Pros and cons of using Big Data
– Positive (2 of the 3 V’s)
‐ A lot of data
‐ Readily available
– Negative
‐ Variety (not that stable)
‐ Potentially biased (selective part of population)
‐ Most are event based (e.g. message oriented, not user)
‐ Little information is available on the users
‐ It’s a challenging data source for producing statistics with
high quality!
4

Big Data studies on Social media
– Statistics oriented
‐ Social media sentiment and Consumer Confidence
‐ Social media based (un)safety monitor
– Population oriented
‐ Users (People, Companies and Others)
‐ Determining background characteristics
‐ We use twiqs.nl, Coosto and Twitter API
5

Social media in the Netherlands
Map by Eric Fischer (via Fast Company)

Social media sentiment
– Studied public Dutch social media collected by Coosto
‐ Not only Twitter, but also Facebook, etc.
‐ Looked at the sentiment (+/-/n) in these messages
‐ Studied the change in overall sentiment over time
‐ Around 3-4 million messages per day
‐ Overall sentiment = (pos. messages – neg. messages)/total
(%)
‐ Day/week/month
7

Daily, weekly, monthly sentiment
8

Sentiment per platform
(~10%) (~80%)

Table 1. Social media messages properties for various platforms and their correlation with consumer confidence
Correlation coefficient of
Social media platform Number of social Number of messages as monthly sentiment index and
media messages1
percentage of total (%) consumer confidence ( r )2
All platforms combined 3,153,002,327 100 0.75 0.78
Facebook 334,854,088 10.6 0.81* 0.85*
Twitter 2,526,481,479 80.1 0.68 0.70
Hyves 45,182,025 1.4 0.50 0.58
News sites 56,027,686 1.8 0.37 0.26
Blogs 48,600,987 1.5 0.25 0.22
Google+ 644,039 0.02 -0.04 -0.09
Linkedin 565,811 0.02 -0.23 -0.25
Youtube 5,661,274 0.2 -0.37 -0.41
Forums 134,98,938 4.3 -0.45 -0.49
1
period covered June 2010 untill November 2013
2
confirmed by visual inspecting scatterplots and additional checks (see text)
*cointegrated
Platform specific results
10 Combination of Facebook andTwitter is best (r > 0.9)
(association continues after that period)

Overall findings
11
– Correlation and cointegration
‐ Consumer confidence survey is conducted during first 2 weeks of a month
‐ Comparing various periods revealed that best correlation and cointegration
is with last 2 weeks of previous month and first 2 weeks of current month
• Highest correlation 0.93* (all Facebook * filteredTwitter)
– Granger causality
‐ Changes in Consumer confidence precede changes in Social media
sentiment
‐ For all combinations shown!
• However: social media is quicker available to us!
– Prediction
‐ Slightly better than random chance
‐ Best for the 4th ‘week’ of month

(Un)safety feeling in social media
– Interviewed people and create a list of words associated
with feelings of (un)safety (347)
– Checked if these words are used in social media (81)
– Only included the most frequently used words (24)
– First version of indicator
‐ Need to: Check context of messages included
‐ Need to: Compare height of peaks with ‘severity’ of
event

Unsafety monitor (first version)
Bomb airport
Brussel
22-03-2016
Truck attack
Nice
14-07-2016
Terrorist attacks
Paris
14-11-2015
Intruder NOS
29-01-2015
Charlie Hebdo
Paris
09-01-2015
MH17 day of
National mourning
23-07-2014
Spain-Neth.
Football (1-5)
13-06-2014
13

(Un)safety feeling in social media (2)
– Interviewed people and create a list of words associated
with feelings of (un)safety (347)
– Check if these words are used in social media (81)
– Only include the frequently used words (24)
– First version of indicator
‐ Need to: Check context of messages included
‐ Want to: Compare height of peaks with other data

Population studies
– Looked at composition of the units active on Twitter
– Type of units
‐ People, companies/organizations, and others
– Tried to determine background characteristics
‐ Not many units provide such information directly
‐ E.g. gender, age, income, level of education etc.
15

Starting point
– Draw a sample of a 1000 user id’s from Twitter
‐ Had a list of 330.000 from a previous study
– It was found that:
‐ 844 still existed
• 691 are persons (82%)
• 119 are companies/organizations (14%)
• 34 are ‘others’ (4%)
• Tried to determine gender
16

17
1)Name
2) Short bio
3) Messages
content
4) Picture

Gender findings: 1) First name
• Used Dutch ‘Voornamenbank’ website (First name database)
• Score between 0 and 1 (female – male); 676 of 844 (80%) names were registered
• Unknown names scored -1 (usually companies/organizations)

Gender findings: 2) Short bio
– If a short bio is provided
– Quite a number of people mention there ‘position’ in the family
‐ Mother, father, papa, mama, ‘son of’, etc.
– Need to check both English and Dutch texts
– 155 of 583 (27%) indicated there gender in short bio
‐ Very precise for women!!
19

Gender findings: 3) Tweets content
– In cooperation with University ofTwente (Dong Nguyen)
– Machine learning approach that checks gender specific writing style
‐ Language specific: Messages need to be Dutch!
‐ 437 of 473 (92%) persons that created tweets could be classified

Gender findings: 4) Profile picture
– Use OpenCV to process pictures
– 1) Face recognition
– 2) Standardisation of faces (resize & rotate)
– 3) Classify faces according to gender
– - 603 of 804 (75%) profile pictures had 1 or more faces on it
1
2
3

Gender findings: overall results (1)
Diagnostic Odds Ratio =
(TP/FN) / (FP/TN)
Random guessing
log(DOR) = 0
‐ Multi-agent findings
• Need ‘clever’ ways to combine these
• Take processing efficiency of the ‘agent’ into consideration
Diagnostic Odds
Ratio (log)
First name 4.33
Short bio 2.70
Tweet content 1.96
Picture (faces) 0.57
22

Gender findings: overall results (2)
Combine results in the best possible way
Unassigned (%) Approach used
844 (100%) 1. Use short bio scores (very precise for females)
689 (82%) 2. Use first name scores
153 (18%) 3. Use Tweet content
29 (3.4%) 4. Use picture
20 (2.4%) 5. Assign male gender
Final log(DOR) is 7.02, an accuracy of 96.5%!
23

Conclusions and future studies
– Social media is one of the most challenging data sources
for official statistics
– Using it requires that we:
‐ Focus on the information available
‐ Think outside the box (i.e. sentiment study)
– Good source to study potential ways to correct for the
selectivity of Big Data sources
– In future studies we will be looking at:
‐ Sentiment, unsafety and more. Population
composition, population dynamics and other
background characteristics
24

The Future
25
The
future
of
statistics
looks
BIG

Thank you for your attention !@pietdaas

Extracting information from ' messy' social media data

Recommended

Recommended

More Related Content

Similar to Extracting information from ' messy' social media data

Similar to Extracting information from ' messy' social media data (20)

More from Piet J.H. Daas

More from Piet J.H. Daas (19)

Recently uploaded

Recently uploaded (20)

Extracting information from ' messy' social media data