2. Useful.Beatiful.Data: social media
− To produce official statistics you need DATA
Its getting more and more difficult to collect data from
respondents
• Response burden
• Decreasing response rates
• Mode effects (CAPI/PAPI/CATI/CAWI)
− What are alternatives?
Admin data sources (since the 80’s)
BIG DATA (NOW), such as social media
4. Potential of social media
− 3 million public messages produced every day in the
Netherlands
mainly on Twitter and Facebook (~60%)
Nearly ‘real-time’ available
− Content: Topics discussed
50% ‘pointless babble’ (noisy) but there are messages relevant for official
statistics
Selecting the relevant part is important (removing noise)
− Producers: Not much info (directly) available
But much can be derived
5. Social media in the Netherlands
Map by Eric Fischer (via Fast Company)
Map by Eric Fischer
6. Examples of social media studies at CBS/CBDS
− Content
1. Sentiment in social media
What is the development of the average sentiment in social media
over time?
2. Feelings of social tension
Can social media be used to measure specific feelings in (the online)
society?
3. Propensity to move (‘Wish to move’)
Can we identify messages of people that wish to move to another
house?
− Population
4. Characterizing users
Derive characteristics /discern subpopulations
8. 1. Social media sentiment (2)
− Facebook and Twitter messages both contribute
− Daily data is highly volatile
− Monthly aggregates correlate well with consumer confidence (> 0.9)
− Including sentiment series improves the accuracy of consumer
confidence series (survey data)
− Product:
Averaged monthly or smoothed weekly online Dutch sentiment could be a
potential new indicator
Can also be produced for large Dutch cities
9. 2. Social tension indicator
Available at: http://research.cbs.nl/socialtension/en/
Percentageofmessagesindicatingsocialtension
10. 2. Social tension indicator (2)
− Currently based on Twitter messages alone
Other platforms can be added
− Selected messages containing specific keywords
These were originally derived from the safety monitor questionnaire
Used the events detected as feedback
− Peaks indicate points in time at which increasing numbers of social
tension related messages are produced
Usually don’t last long
Sometime a shift in the base line is observed (i.e. MH17)
− Product: can be produced on a daily basis
This is how ‘real-time’ statistics will look like
11. 3. ‘Wish to move’
− Current topic of research
Social media contains messages that indicate a ‘wish’ of people
to move to another house (on all platforms)
Select messages containing ‘verhuiz*’ or ‘verhuis*’
Created a model to identify messages of people that wish to
move (accuracy 0.85 ±0.02)
Relate social media findings to findings derived from
survey/admin data
Study time-series to check on what frequency such an indicator
could best be produced
12. 4. Characterizing social media users
− Social media contains multiple populations
− Identifying Dutch users
• 3 approaches:
From meta/para data available (language setting, location)
From network structure (following), essential when hardly any
user info is present
From texts, discern Dutch and Flemish ‘tweets’
13. 4. Characterizing social media users (2)
− Social media contains multiple populations
− Discerning between accounts of people and companies
2 step approach
Human
(private users)
Non-human
(corporate users)
Private persons
(77%)
Self-employed
(9%)
‘Non-profit groups’
(11%)
Companies
(3%)
14. 4. Characterizing social media users (3)
− Social media contains multiple populations
− Identifying background characteristics
Challenging topic:
• Gender (M/F) could be identified with 96% accuracy
- Combining user short bio, first names, tweet content and pictures
- 50% male, 33% female, 17% ‘others’
• Other characteristics are possible (future research)
15. Conclusion
− Social media is an interesting data source for official statistics
− To enable this, two steps are essential:
– Noise reduction
− By aggregating lots of data
− By removing ‘off-topic’ messages
− Correct differences between ‘on-line’ and ‘real-world’ populations
– By removing non-target population users
– By applying a model (work in progress)