Your SlideShare is downloading. ×
  • Like
  • Save
Daas twitter as-a_data_source_for_official_statistics-131
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Daas twitter as-a_data_source_for_official_statistics-131

  • 169 views
Published

A GOR presentation

A GOR presentation

Published in Business , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
169
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. General Online Research ConferenceGOR 11, March 14-16, 2011, Heinrich Heine University, Düsseldorf, GermanyFirst author: Piet Daas, Statistics NetherlandsSecond author: Mark van de Ven, Erasmus University RotterdamThird author: Marko Roos, Statistics NetherlandsTwitter as a data source for official statistics: first resultsContact: pjh.daas@cbs.nl
  • 2. Twitter as a data sourcefor official statisticsFirst resultsPiet Daas, Mark van de Ven, and Marko RoosStatistics Netherlands, Erasmus University Rotterdam GOR 2011
  • 3. Overview • Why data sources such as Twitter? • Focus of our research • Data collection • 2 approaches • Messages obtained • Topic identification • Conclusion#GOR11: Twitter as a data source for official statistics: first results
  • 4. Why are we interested in data sources, such as Twitter? • All National Statistical Institutes use: • Survey data • Sometimes also Administrative data (registers) • But there are other sources of information out there (a lot of electronic ones) • Can they also be used? • Investigate it! (studies are supported by DG of Stat. Neth.)#GOR11: Twitter as a data source for official statistics: first results
  • 5. Examples of data sources studied ‘New’ data sources are studied at our office 1. Product prices on the internet 2. Mobile phone data 3. Global Positioning System (GPS) data (and traffic loop info) 4. Social media: Twitter Can the be used for statistics?#GOR11: Twitter as a data source for official statistics: first results
  • 6. Social media: Twitter • Social media is used more and more intensively in the Netherlands & the World • Potential source of personal information, opinions, and sentiments • But what type of information is actually exchanged? • Investigated Twitter (as an example) • Easily accessible (text)data and is used a lot in the Netherlands • Identify the topic discussed in the Netherlands#GOR11: Twitter as a data source for official statistics: first results
  • 7. Social media: Twitter (2) • Twitter is a micro blogging service • Text messages of 140 characters max • Called ‘tweets’ • Posted to the public or to friends only • Hashsign (#) is used to highlight ‘keywords’ • Example: #Eurostat, #GOR11 • A few examples#GOR11: Twitter as a data source for official statistics: first results
  • 8. Focus of our research • Identify topics discussed on Twitter in the Netherlands • On the basis of that information decide if Twitter (and perhaps social media in general) is of interest for Statistics Netherlands • Collect tweets from ‘all’ Dutch Twitter users! • Try to get a complete overview as possible#GOR11: Twitter as a data source for official statistics: first results
  • 9. Data collection: approach • Collect tweets of Dutch Twitter users 1. Located in the Netherlands 2. In Dutch language (optional) • First option: Make use of advanced search option of Twitter • ‘Survey’ a specific area • Language filtering#GOR11: Twitter as a data source for official statistics: first results
  • 10. #GOR11: Twitter as a data source for official statistics: first results
  • 11. Circle around Utrecht (200 km radius) Problem areas: Belgium: Flanders!! Germany#GOR11: Twitter as a data source for official statistics: first results
  • 12. Data collection: first approach (2)• Wrote a program that • Queried the ‘200km Utrecht circle’ every 5 min. • Collects all new Dutch tweets • Results seemed OK • Non-Dutch located tweets only comprised 4% of total • Could easily be removed by • Major city name filtering (Brussels, Antwerp, Cologne etc.) • Checking geocodes (remove all outside the Dutch borders)#GOR11: Twitter as a data source for official statistics: first results
  • 13. Data collection: first approach (3)• Typical profile of data collected
  • 14. Data collection: first approach (4) • However, there were issues • Maximum nr. Tweets collected is 1500 • Sometimes few and sometimes > 1500 (in 5 min.) • Twitter language filtering not perfect • Some were wrongly identified, some no language • Slang is often used (combo of English, Dutch & street lang.) • Twitter applies a ‘quality’ filter • To reduce spam, could affect our collection process • Our test messages were hardly ever included!#GOR11: Twitter as a data source for official statistics: first results
  • 15. Data collection: second approach • User oriented approach • First: Collect as many Dutch Twitter usernames (ID’s) as possible • Expected between 150,000 – 400,000 • Second: Collect tweets from all those users • Max 3600 per user, 200 per request • Third: Identify topics discussed in the tweets#GOR11: Twitter as a data source for official statistics: first results
  • 16. Data collection: second approach (2) • Step 1: Collect Dutch usernames • Crawled through users • Started with a user with a large number of followers, collect names of followers, followers of followers etc. • Account of a famous Dutch politician as root • User is Dutch if location in user profile includes ‘Netherlands’ and/or the name of a Dutch municipality or province • In this way 380,415 unique usernames were collected • Which were all certainly located in the Netherlands#GOR11: Twitter as a data source for official statistics: first results
  • 17. Data collection: second approach (3) • Step 2: Collect tweets • For each user, collected up to 200 of his/her most recent tweets • Theoretical max is 3600 (to much and to demanding) • A total of 12 million tweets were collected • Covered 2009 and first 9 months of 2010 • Related to the activity of individual user#GOR11: Twitter as a data source for official statistics: first results
  • 18. Data collection: second approach (4) • Step 3: Identify topics discussed • Many tweets contained #hashtags • ~1,8 million (14.5%) • Used these tweets to manually identify topics • 16,439 different hashtags • But distribution is highly skewed • First 500 comprised nearly 50% of all hashtag containing tweets • Started with first 500, then added the rest#GOR11: Twitter as a data source for official statistics: first results
  • 19. Hashtag topic identification All #hashtags News Events Products Companies Locations Top 500 #hashtags Radio 1% 1% 1% 1% 1% Emotions • Results of #hashtag classification Applications 2% 3% TV Politics Politics 7% 3% Sports 3% Applications 7% 9% TV 3% 6% Sports 4% Emotions Twitter 6% 12% Twitter 5% Locations 3% Products 3% 3% Events 2% Other News 72% 2% OtherCompanies Radio 1% 38% #GOR11: Twitter as a data source for official statistics: first results
  • 20. Hashtag topic identification (2) Classification of Twitter messages collected according to hashtags used Top 500# All# Top 500# no All# noCategory Description Examples only (%) (%) Other (%) Other (%)Twitter Twitter/internet specific language & slang #durftevragen, #fail, #twexit 12 5 19 19Sports Sports, clubs, and sports events #WK2010, #ajax, #oranje 9 4 14 14Applications Twitter specific programs #nowplaying, #lastfm, #in 8 3 13 12Politics Political debates, leaders, and parties #tk2010, #NOSdebat, #formatie 7 3 11 11TV Dutch TV-programs (no political & no news) #dwdd, #ohohcherso, #tvoh 6 3 10 11Emotions Sentiment and feelings #moe, #LOL, #zucht, #heerlijk 6 3 10 10Locations References to a location or municipality #amsterdam, #utrecht 3 2 5 5Products Referring to products #iPhone, #iPad, #android 3 1 4 4Events Non-sport and non-political happenings #twibbon, #LL10, #lowlands 3 1 4 4News Referring to news programs #nos, #pownews, #Nujij 2 1 4 4Companies Referring to companies #ns, #google, #tmobile, #KPN 2 1 4 4Radio Dutch radio programs #3fm, #53j8, #radio1 1 1 2 2Other Rest group, mostly unrelated tags #koffie, #goedemorgen 38 72 - - #GOR11: Twitter as a data source for official statistics: first results
  • 21. Twitter (hashtag) conclusion • First results (based on hashtag tweets) • Potential interesting for politics and events (‘leisure time’) • Overall study suggests ~4% in our total dataset • Around 480,000 tweets • Twitter could probably also be used: • for info on social and cultural participation and on social cohesion • Need to further refine our studies • More in depth studies of all tweets collected (also without #) • Use (more advanced) text mining techniques for classification#GOR11: Twitter as a data source for official statistics: first results
  • 22. But• Representativity of the data is an issue • Clear that only a subset of the (Dutch) population is observed • Not everybody in the Netherlands is active on Twitter • Hardly any background information available • However, some users provide very interesting details in their user profile• Representativity of new data sources is a key issue in future research#GOR11: Twitter as a data source for official statistics: first results
  • 23. Thank you for your attention! • #Questions?#GOR11: Twitter as a data source for official statistics: first results