Twitter as a data mining source

7,245 views
7,034 views

Published on

Published in: Technology

Twitter as a data mining source

  1. 1. Czech Twitter as a data mining source Josef Šlerka, WebExpo 2009
  2. 2. Twitter.com Twitter is a free social networking and micro- blogging service that enables its users to send and read messages knows as tweets. Tweets are text-based posts of up to 140 characters displayed on the authorʼs profile page and delivered to the authorʼs subscribers who are known as followers (Wikipedia)
  3. 3. What is data mining and how is it connected with Twitter?
  4. 4. Data mining is the process of extracting patterns from data. As more data are gathered, data mining is becoming an increasingly important tool to transform there data into information (Wikipedie) Different variations would be text mining, web mining including semantic analysis
  5. 5. Twitter Data mining - makes it easy to use all data mining methods - adds ʻʻtimeʼʼ & ʻʻspaceʼʼ - provides real-time picture - easy connects with other social media (about 30% users have unique nickname for all platforms)
  6. 6. Data mining - different methods - different variations of semantic distance of similarities (Jaccard index) - frequency analysis based on time (are people happier in the morning or in the evening?) - frequency analysis based on location - one of the results -> identification of opinion makers in the social networks
  7. 7. Transmission News using different APIs to get more information
  8. 8. Transmission News = 5 APIs in one www. transnews.tw • 5x Twitter News Service accounts • 1x Yahoo Geo • 1x Google Search AJAX • 1x Google Maps • 1x Open Calais • and a little bit of Wikipedia
  9. 9. www.transnews.tw
  10. 10. This brings us to the downside of Twitter API
  11. 11. API searches are limited to the number of inquiries Even worse, their data doesnʼt go farther than 1.5 weeks in the past
  12. 12. Hence the development of Sparrow 1.0
  13. 13. Czech Twitter by the numbers
  14. 14. Sparrow 1.0 application methodology - archives all tweets located in Czech republic in hourly interval via Twitter API (starting June 2009) - automatically detects language - identifies Czech tweets with word count dictionary - compares Czech Twitter statistics with foreign countriesʼ statistics
  15. 15. Sparrow 1.0 - June 2009 stats - about 700.000 tweets - created by 10,628 unique users who enabled their geo-location (CZ) or tweeted in Czech - 5.880 users tweeted at least once in Czech - 2.424 Czech writing users revealed their geo-location (usually about 30% of users do that)
  16. 16. How many Twitter users are in the Czech republic? Between 6,000 - 8,000 users write in Czech 1.000 až 2.000 users prefer English There are about 10,000 active Twitter users in CR
  17. 17. Whatʼs the Czech Twitter dynamics? Every four weeks the number of users with at least one tweet rises about 25% The number of active users rises 3-5% each week Absolute number of tweets rises about 25% too
  18. 18. What characteristics do Czech tweets have? 2 % are RT 4 % use a ʻʼ#ʼʼ 21.5 % represent reply and conversation 34.6 % includes a link
  19. 19. What languages do people in the CR use for tweeting?
  20. 20. Letʼs see that graph English Czech Slovak Deutsch others 13% 4% 7% 44% 33%
  21. 21. Geo-location breakdown of Tweets among big cities in CR (July-August 2009) 6. Liberec 14178x en - 9561x ~ 67.44% 1. Praha 247685x cs - 2864x ~ 20.20% en - 116580x ~ 47.07% sk - 462x ~ 3.26% cs - 79957x ~ 32.28% 9 cities Prague others sk - 16449x ~ 6.64% 7. České Budějovice 6219x 2. Brno 37021x cs - 2589x ~ 41.63% en - 16104x ~ 43.50% en - 1386x ~ 22.29% cs - 14753x ~ 39.85% es - 551x ~ 8.86% sk - 3360x ~ 9.08% 8. Hradec Králové 3. Ostrava 23836x 11888x en - 13885x ~ 58.25% 25% cs - 4696x ~ 39.50% cs - 5306x ~ 22.26% 30% en - 4400x ~ 37.01% pl - 1638x ~ 6.87% de - 1113x ~ 9.36% 4. Plzeň 13681x 9. Ústí nad Labem en - 9160x ~ 66.95% 12016x cs - 2206x ~ 16.12% en - 4266x ~ 35.50% fr - 417x ~ 3.05% de - 2882x ~ 23.98% cs - 2570x ~ 21.39% 5. Olomouc 10754 en - 4619x ~ 42.95% 10. Pardubice 5576x cs - 3062x ~ 28.47% cs - 2718x ~ 48.74% pt - 999x ~ 9.29% 45% en - 1831x ~ 32.84% sk - 414x ~ 7.42%
  22. 22. And what about ʻʻwhen?ʼʼ And why does it matter?
  23. 23. This is what weʼve learned in a few months: - Czechs tweet most often on Tuesday or Thursday, and the least in Saturday Around the world the most popular day is Tuesday, and the least is Sunday - The number of tweets rises steadily from the beginning to the end of the month, then falls and begins rising again. That means people tweet more at the end of the month than at the beginning.
  24. 24. Prediction of the presence Google vs. Twitter
  25. 25. MADONNA IN PRAGUE 13. 8. 2009
  26. 26. Madonna - August 2009 - Google search
  27. 27. Madonna - August 2009 - Czech Twitter
  28. 28. Sometimes Twitter is quicker & can predict future searches
  29. 29. September 17th, Ostrava
  30. 30. Rammstein - August 2009 - Google search
  31. 31. Rammstein - August 2009 - Czech Twitter 17.9.2009
  32. 32. Thanks for your attention. Questions? Ideas? slerka@ataxo.com

×