Useful.Beatiful.Data: social media
− To produce official statistics you need DATA
 Its getting more and more difficult to collect data from
respondents
• Response burden
• Decreasing response rates
• Mode effects (CAPI/PAPI/CATI/CAWI)
− What are alternatives?
 Admin data sources (since the 80’s)
 BIG DATA (NOW), such as social media
The glass if half full
Potential of social media
− 3 million public messages produced every day in the
Netherlands
 mainly on Twitter and Facebook (~60%)
 Nearly ‘real-time’ available
− Content: Topics discussed
 50% ‘pointless babble’ (noisy) but there are messages relevant for official
statistics
 Selecting the relevant part is important (removing noise)
− Producers: Not much info (directly) available
 But much can be derived
Social media in the Netherlands
Map by Eric Fischer (via Fast Company)
Map by Eric Fischer
Examples of social media studies at CBS/CBDS
− Content
1. Sentiment in social media
 What is the development of the average sentiment in social media
over time?
2. Feelings of social tension
 Can social media be used to measure specific feelings in (the online)
society?
3. Propensity to move (‘Wish to move’)
 Can we identify messages of people that wish to move to another
house?
− Population
4. Characterizing users
 Derive characteristics /discern subpopulations
1. Social media sentiment
1. Social media sentiment (2)
− Facebook and Twitter messages both contribute
− Daily data is highly volatile
− Monthly aggregates correlate well with consumer confidence (> 0.9)
− Including sentiment series improves the accuracy of consumer
confidence series (survey data)
− Product:
 Averaged monthly or smoothed weekly online Dutch sentiment could be a
potential new indicator
 Can also be produced for large Dutch cities
2. Social tension indicator
Available at: http://research.cbs.nl/socialtension/en/
Percentageofmessagesindicatingsocialtension
2. Social tension indicator (2)
− Currently based on Twitter messages alone
 Other platforms can be added
− Selected messages containing specific keywords
 These were originally derived from the safety monitor questionnaire
 Used the events detected as feedback
− Peaks indicate points in time at which increasing numbers of social
tension related messages are produced
 Usually don’t last long
 Sometime a shift in the base line is observed (i.e. MH17)
− Product: can be produced on a daily basis
 This is how ‘real-time’ statistics will look like
3. ‘Wish to move’
− Current topic of research
 Social media contains messages that indicate a ‘wish’ of people
to move to another house (on all platforms)
 Select messages containing ‘verhuiz*’ or ‘verhuis*’
 Created a model to identify messages of people that wish to
move (accuracy 0.85 ±0.02)
 Relate social media findings to findings derived from
survey/admin data
 Study time-series to check on what frequency such an indicator
could best be produced
4. Characterizing social media users
− Social media contains multiple populations
− Identifying Dutch users
• 3 approaches:
 From meta/para data available (language setting, location)
 From network structure (following), essential when hardly any
user info is present
 From texts, discern Dutch and Flemish ‘tweets’
4. Characterizing social media users (2)
− Social media contains multiple populations
− Discerning between accounts of people and companies
 2 step approach
Human
(private users)
Non-human
(corporate users)
Private persons
(77%)
Self-employed
(9%)
‘Non-profit groups’
(11%)
Companies
(3%)
4. Characterizing social media users (3)
− Social media contains multiple populations
− Identifying background characteristics
 Challenging topic:
• Gender (M/F) could be identified with 96% accuracy
- Combining user short bio, first names, tweet content and pictures
- 50% male, 33% female, 17% ‘others’
• Other characteristics are possible (future research)
Conclusion
− Social media is an interesting data source for official statistics
− To enable this, two steps are essential:
– Noise reduction
− By aggregating lots of data
− By removing ‘off-topic’ messages
− Correct differences between ‘on-line’ and ‘real-world’ populations
– By removing non-target population users
– By applying a model (work in progress)
Useful by Piet Daas

Useful by Piet Daas

  • 2.
    Useful.Beatiful.Data: social media −To produce official statistics you need DATA  Its getting more and more difficult to collect data from respondents • Response burden • Decreasing response rates • Mode effects (CAPI/PAPI/CATI/CAWI) − What are alternatives?  Admin data sources (since the 80’s)  BIG DATA (NOW), such as social media
  • 3.
    The glass ifhalf full
  • 4.
    Potential of socialmedia − 3 million public messages produced every day in the Netherlands  mainly on Twitter and Facebook (~60%)  Nearly ‘real-time’ available − Content: Topics discussed  50% ‘pointless babble’ (noisy) but there are messages relevant for official statistics  Selecting the relevant part is important (removing noise) − Producers: Not much info (directly) available  But much can be derived
  • 5.
    Social media inthe Netherlands Map by Eric Fischer (via Fast Company) Map by Eric Fischer
  • 6.
    Examples of socialmedia studies at CBS/CBDS − Content 1. Sentiment in social media  What is the development of the average sentiment in social media over time? 2. Feelings of social tension  Can social media be used to measure specific feelings in (the online) society? 3. Propensity to move (‘Wish to move’)  Can we identify messages of people that wish to move to another house? − Population 4. Characterizing users  Derive characteristics /discern subpopulations
  • 7.
    1. Social mediasentiment
  • 8.
    1. Social mediasentiment (2) − Facebook and Twitter messages both contribute − Daily data is highly volatile − Monthly aggregates correlate well with consumer confidence (> 0.9) − Including sentiment series improves the accuracy of consumer confidence series (survey data) − Product:  Averaged monthly or smoothed weekly online Dutch sentiment could be a potential new indicator  Can also be produced for large Dutch cities
  • 9.
    2. Social tensionindicator Available at: http://research.cbs.nl/socialtension/en/ Percentageofmessagesindicatingsocialtension
  • 10.
    2. Social tensionindicator (2) − Currently based on Twitter messages alone  Other platforms can be added − Selected messages containing specific keywords  These were originally derived from the safety monitor questionnaire  Used the events detected as feedback − Peaks indicate points in time at which increasing numbers of social tension related messages are produced  Usually don’t last long  Sometime a shift in the base line is observed (i.e. MH17) − Product: can be produced on a daily basis  This is how ‘real-time’ statistics will look like
  • 11.
    3. ‘Wish tomove’ − Current topic of research  Social media contains messages that indicate a ‘wish’ of people to move to another house (on all platforms)  Select messages containing ‘verhuiz*’ or ‘verhuis*’  Created a model to identify messages of people that wish to move (accuracy 0.85 ±0.02)  Relate social media findings to findings derived from survey/admin data  Study time-series to check on what frequency such an indicator could best be produced
  • 12.
    4. Characterizing socialmedia users − Social media contains multiple populations − Identifying Dutch users • 3 approaches:  From meta/para data available (language setting, location)  From network structure (following), essential when hardly any user info is present  From texts, discern Dutch and Flemish ‘tweets’
  • 13.
    4. Characterizing socialmedia users (2) − Social media contains multiple populations − Discerning between accounts of people and companies  2 step approach Human (private users) Non-human (corporate users) Private persons (77%) Self-employed (9%) ‘Non-profit groups’ (11%) Companies (3%)
  • 14.
    4. Characterizing socialmedia users (3) − Social media contains multiple populations − Identifying background characteristics  Challenging topic: • Gender (M/F) could be identified with 96% accuracy - Combining user short bio, first names, tweet content and pictures - 50% male, 33% female, 17% ‘others’ • Other characteristics are possible (future research)
  • 15.
    Conclusion − Social mediais an interesting data source for official statistics − To enable this, two steps are essential: – Noise reduction − By aggregating lots of data − By removing ‘off-topic’ messages − Correct differences between ‘on-line’ and ‘real-world’ populations – By removing non-target population users – By applying a model (work in progress)