New Data Sources forStatistics: Experiences atStatistics NetherlandsSocial media: TwitterPiet Daas, Marko Roos, Mark van d...
Why are we interested in data sources,such as Twitter?• All National Statistical Institutes use:     • Survey data     • S...
Why study Twitter?                                                          Maps by Eric Fischer (via Fast Company)AAPOR 2...
About Twitter•      Twitter is used intensively in the Netherlands           • Relatively easily accessible (text)data•   ...
Start with collecting data• How?  • Tried several ways     • Best option was to:          1) Collect usernames          2)...
1) Collect usernames• Breadth first algorithm / snowball sampling   • Started with a user with many followers         • A ...
2) Identify ‘Dutch’ users• By using location information provided     • A considerable number of users do this          • ...
3) Collect tweets• For the 380,415 users the 200 most  recent tweets were collected     • A total of 12,093,065 messages w...
4) Identify topics•     Used 2 approaches     1) Hashtags (1,750,074 with 1 hash, 14.5%)          •     Hashsign (#) ident...
Topic identification: Hashtags           Economy                                                                          ...
Topic identification: Non-hashtags*           Economy                                                                     ...
Topic identification: Combined           Economy                                                                          ...
Conclusions• Is Twitter of potential interest for statistics?     • Yes• What are the interesting topics for us?     • Wor...
Conclusions (2)• Representativity of the data is a serious issue  • Clear that only a subset of the (Dutch) population    ...
Future work• Continue to study Social media!• But:     1) No longer collect data ourselves (                      )     2)...
Thank you for your attention!• #Questions?    Contact or follow me at: @pietdaasAAPOR 2012: Twitter as a potential data so...
Upcoming SlideShare
Loading in...5
×

New Data Sources for Statistics, Social media: Twitter.

694

Published on

Presentation at the American Association for Public Online Research Conference 2012, Orlando, Fl, USA.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
694
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

New Data Sources for Statistics, Social media: Twitter.

  1. 1. New Data Sources forStatistics: Experiences atStatistics NetherlandsSocial media: TwitterPiet Daas, Marko Roos, Mark van de Ven and Joyce NeroniStatistics Netherlands AAPOR 2012
  2. 2. Why are we interested in data sources,such as Twitter?• All National Statistical Institutes use: • Survey data • Sometimes also Administrative data• But there are other sources of information out there (in increasing numbers: BIG Data) • Can they be used for statistics? • Burden and cost reduction • Try it! • Innovative research is greatly stimulatedAAPOR 2012: Twitter as a potential data source for statistics 1
  3. 3. Why study Twitter? Maps by Eric Fischer (via Fast Company)AAPOR 2012: Twitter as a potential data source for statistics 2
  4. 4. About Twitter• Twitter is used intensively in the Netherlands • Relatively easily accessible (text)data• Potential source of personal information, opinions, and sentiments• But what kind of information is actually discussed? 1) Identify the topics discussed in the Netherlands • In public tweets only 2) Is this information useful? AAPOR 2012: Twitter as a potential data source for statistics 3
  5. 5. Start with collecting data• How? • Tried several ways • Best option was to: 1) Collect usernames 2) Identify ‘Dutch’ users 3) Collect tweets from Dutch users 4) Identify topics in those tweetsAAPOR 2012: Twitter as a potential data source for statistics 4
  6. 6. 1) Collect usernames• Breadth first algorithm / snowball sampling • Started with a user with many followers • A famous Dutch politician with 79,798 followers • Collect the followers of her followers etc. • By Twitter REST API, 12 user accounts and PHP-scripts • After 4 weeks we obtained • 4,413,391 unique users (id’s) • Collected user id, username, location and profile information AAPOR 2012: Twitter as a potential data source for statistics 5
  7. 7. 2) Identify ‘Dutch’ users• By using location information provided • A considerable number of users do this • Checked the location names provided • Inclusion and exclusion list • A total of 380,415 (~9%) users were identified as located in the Netherlands • 38% of the users, 1,661,467, provided no location infoAAPOR 2012: Twitter as a potential data source for statistics 6
  8. 8. 3) Collect tweets• For the 380,415 users the 200 most recent tweets were collected • A total of 12,093,065 messages was obtained • 39% of the users had no ‘tweets’ • Some characteristicsAAPOR 2012: Twitter as a potential data source for statistics 7
  9. 9. 4) Identify topics• Used 2 approaches 1) Hashtags (1,750,074 with 1 hash, 14.5%) • Hashsign (#) identifies ‘keyword’ • E.g. #ned, #fail, #wk2010 • Manual and text-mining approach 2) Non-hashtags (10,330,613 in total, 85.4%) • Manual (sample) • Text-mining approach failed here • Result of the large ‘Other’ groupAAPOR 2012: Twitter as a potential data source for statistics 8
  10. 10. Topic identification: Hashtags Economy Hashtags Education Non-hashtags Environment Total Events Health Holiday ICT Living Media Politics (20%) RelationsThemes Security Spare time (9%) Sports (13%) Transport Weather Work Other (18%) 0 10 20 30 40 50 Contribution (%) AAPOR 2012: Twitter as a potential data source for statistics 9
  11. 11. Topic identification: Non-hashtags* Economy Hashtags Education Non-hashtags Environment Total Events Health Holiday ICT Living Media Politics RelationsThemes Security Spare time (10%) Sports (6%) Transport Weather Work Other (51%) 0 10 20 30 40 50 Contribution (%) * A random sample AAPOR 2012: Twitter as a potential data source for statistics 10
  12. 12. Topic identification: Combined Economy Hashtags Education Non-hashtags Environment Total Events (1%) Health Holiday ICT Living Media (7%) Politics (3%) RelationsThemes Security (10%) Spare time Sports (7%) Transport Weather Work (5%) (46%) Other 0 10 20 30 40 50 Contribution (%) AAPOR 2012: Twitter as a potential data source for statistics 11
  13. 13. Conclusions• Is Twitter of potential interest for statistics? • Yes• What are the interesting topics for us? • Work (5%), politics (3%), spare time (10%) and events (1%)• Can the data be used ‘as is’? • No - ‘Low information content’ - Representativity of usersAAPOR 2012: Twitter as a potential data source for statistics 12
  14. 14. Conclusions (2)• Representativity of the data is a serious issue • Clear that only a subset of the (Dutch) population is observed • Not everybody in the Netherlands is active on Twitter • Hardly any background information available • Although some users provide very interesting details in their user profile• Work around? • (Only) use twitter to get quick info (a trend) on a specific topic AAPOR 2012: Twitter as a potential data source for statistics 13
  15. 15. Future work• Continue to study Social media!• But: 1) No longer collect data ourselves ( ) 2) In future studies focus on: • Mine sentiment towards specific topics • E.g. Economy, Consumer sentiment, but also statistics and Statistics Netherlands survey’s • Background info of usersAAPOR 2012: Twitter as a potential data source for statistics 14
  16. 16. Thank you for your attention!• #Questions? Contact or follow me at: @pietdaasAAPOR 2012: Twitter as a potential data source for statistics 15

×