Useful by Piet Daas

Useful.Beatiful.Data: social media
− To produce official statistics you need DATA
 Its getting more and more difficult to collect data from
respondents
• Response burden
• Decreasing response rates
• Mode effects (CAPI/PAPI/CATI/CAWI)
− What are alternatives?
 Admin data sources (since the 80’s)
 BIG DATA (NOW), such as social media

Potential of social media
− 3 million public messages produced every day in the
Netherlands
 mainly on Twitter and Facebook (~60%)
 Nearly ‘real-time’ available
− Content: Topics discussed
 50% ‘pointless babble’ (noisy) but there are messages relevant for official
statistics
 Selecting the relevant part is important (removing noise)
− Producers: Not much info (directly) available
 But much can be derived

Social media in the Netherlands
Map by Eric Fischer (via Fast Company)
Map by Eric Fischer

Examples of social media studies at CBS/CBDS
− Content
1. Sentiment in social media
 What is the development of the average sentiment in social media
over time?
2. Feelings of social tension
 Can social media be used to measure specific feelings in (the online)
society?
3. Propensity to move (‘Wish to move’)
 Can we identify messages of people that wish to move to another
house?
− Population
4. Characterizing users
 Derive characteristics /discern subpopulations

1. Social media sentiment (2)
− Facebook and Twitter messages both contribute
− Daily data is highly volatile
− Monthly aggregates correlate well with consumer confidence (> 0.9)
− Including sentiment series improves the accuracy of consumer
confidence series (survey data)
− Product:
 Averaged monthly or smoothed weekly online Dutch sentiment could be a
potential new indicator
 Can also be produced for large Dutch cities

2. Social tension indicator
Available at: http://research.cbs.nl/socialtension/en/
Percentageofmessagesindicatingsocialtension

2. Social tension indicator (2)
− Currently based on Twitter messages alone
 Other platforms can be added
− Selected messages containing specific keywords
 These were originally derived from the safety monitor questionnaire
 Used the events detected as feedback
− Peaks indicate points in time at which increasing numbers of social
tension related messages are produced
 Usually don’t last long
 Sometime a shift in the base line is observed (i.e. MH17)
− Product: can be produced on a daily basis
 This is how ‘real-time’ statistics will look like

3. ‘Wish to move’
− Current topic of research
 Social media contains messages that indicate a ‘wish’ of people
to move to another house (on all platforms)
 Select messages containing ‘verhuiz*’ or ‘verhuis*’
 Created a model to identify messages of people that wish to
move (accuracy 0.85 ±0.02)
 Relate social media findings to findings derived from
survey/admin data
 Study time-series to check on what frequency such an indicator
could best be produced

4. Characterizing social media users
− Social media contains multiple populations
− Identifying Dutch users
• 3 approaches:
 From meta/para data available (language setting, location)
 From network structure (following), essential when hardly any
user info is present
 From texts, discern Dutch and Flemish ‘tweets’

4. Characterizing social media users (2)
− Discerning between accounts of people and companies
 2 step approach
Human
(private users)
Non-human
(corporate users)
Private persons
(77%)
Self-employed
(9%)
‘Non-profit groups’
(11%)
Companies
(3%)

4. Characterizing social media users (3)
− Identifying background characteristics
 Challenging topic:
• Gender (M/F) could be identified with 96% accuracy
- Combining user short bio, first names, tweet content and pictures
- 50% male, 33% female, 17% ‘others’
• Other characteristics are possible (future research)

Conclusion
− Social media is an interesting data source for official statistics
− To enable this, two steps are essential:
– Noise reduction
− By aggregating lots of data
− By removing ‘off-topic’ messages
− Correct differences between ‘on-line’ and ‘real-world’ populations
– By removing non-target population users
– By applying a model (work in progress)

Useful by Piet Daas

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Useful by Piet Daas

Similar to Useful by Piet Daas (20)

More from Centraal Bureau voor de Statistiek

More from Centraal Bureau voor de Statistiek (20)

Recently uploaded

Recently uploaded (20)

Useful by Piet Daas