Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2015 hypertext-election prediction

659 views

Published on

ACM Hypertext 2015 presentation. The full paper can be found here: http://dl.acm.org/citation.cfm?id=2791033

Published in: Social Media
  • Be the first to comment

  • Be the first to like this

2015 hypertext-election prediction

  1. 1. Twitter-based election prediction in the developing world Nugroho Dwi Prasetyo & Claudia Hauff
  2. 2. Twitter-based election polling is a cheap alternative to traditional “offline” polls. Twitter-based election polling should achieve a prediction accuracy similar to traditional polls. millions of potential voters inferred votes biases The what & why
  3. 3. @flickr:misteraitch “No, you cannot predict elections with Twitter.” D. Gayo-Avello. Internet Computing, IEEE 16.6 (2012): 91-94.
  4. 4. @flickr:misteraitch “No, you cannot predict elections with Twitter.” D. Gayo-Avello. Internet Computing, IEEE 16.6 (2012): 91-94. That hasn’t stopped people from trying!
  5. 5. @flickr:practicalowl Germany Federal Count tweets & hashtags 5 weeks6 party names 1.7% Singapore Presidential Count tweets + sentiment 1 week4 candidate names 6.1% USA Presidential Count tweets + sentiment 6 months2 candidate names 11.6% Ireland General Count tweets + sentiment 3 weeks5 party names + election hashtag 3-6% Netherlands Senate Count tweets 1 month12 Dutch words 1.3% USA Presidential Count tweets 6 weeks2 N/A 1.7% Germany Federal Count hashtags + sentiment 4 months6 party names + election hashtags N/A USA, France Presidential sentiment 2 months2 candidate names + election hashtag N/A USA Republican nomination Count tweets + sentiment 1 year7 candidate names N/A Venezuela, Paraguay, Ecuador Presidential Count tweets + users 7 months 2 3 2 candidate names and aliases 0.1%- 19%
  6. 6. So far … Twitter-based predictions lack behind traditional polls. Most works focus on elections in the developed world. Traditional polls are accurate. Traditional polls are conducted often.
  7. 7. So far … Twitter-based predictions lack behind traditional polls. Most works focus on elections in the developed world. What do Twitter-based methods add?
  8. 8. In the developing world … traditional polls are less likely to be reliable. … the demographic bias of Twitter users is high. 4.08% 3.45%11.75% 4.21% 12.24% 5.64% 6.25% 1.36% 2.69% 1.19% 7.02% 4.20% 8.84% 0.98% 3.96% 3.13% 4.24%1.15% 0.87% 11.49% Mean Absolute Error of 20 traditional polls conducted in the run-up to the 2014 Indonesian presidential election
  9. 9. A detailed analysis of all major factors of Twitter-based election forecasting with a special emphasis on de- biasing through “offline” data. An in-depth comparison of 20 traditional polls and Twitter-based forecasts for the 2014 Indonesian presidential election. Our contributions @flickr:carbonnyc
  10. 10. Approach
  11. 11. Processing pipeline (1) Data collection election type data access duration keywords (3) Data de-biasing age gender location (2) Data filtering spam organisations geo-location (4) Election prediction candidate mentions one vote per user tweet sentiment
  12. 12. The ground truth election outcome & traditional polls predicted vote % election vote %#candidates
  13. 13. Use case & data @flickr:rh2ox
  14. 14. 2014 Indonesian 
 presidential election Joko Widodo vs. Prabowo Subianto Widodo won 53.15% of the votes. Widodo won in 23 of the 33 provinces. Widodo was supported by the opposition. July 9, 2014
  15. 15. Gathered tweets Crawling period #Electoral tweets Max. tweets / day #Users Max. active users / day April 15 - July 8, 2014 7,020,228 375,064 490,270 148,135 Manually curated keyword list (updated daily); only tweets geo-located in Indonesia are included. POLLDATA
  16. 16. Gathered tweets II #Users Most recent 100 tweets per user. Not used for prediction purposes. USERDATA Crawling period July 25 - 30, 2014 #Tweets ~42,000,000 490,270
  17. 17. Insights into data @flickr:edith_soto
  18. 18. Is spam a problem? 7.4% are spam users 2.1% are “slacktivists” 3.8% are non-personal users Based on a manual classification of 600 randomly selected users in USERDATA
  19. 19. How large is the bias? Based on a manual classification of 600 randomly selected users in USERDATA 0% 20% 40% 60% 80% Female Male Twitter Population gender 0% 20% 40% 60% 80% 0-19 20-49 50+ Twitter Population age
  20. 20. How large is the bias? 0% 20% 40% 60% 80% Female Male Twitter Population gender 0% 20% 40% 60% 80% 0-19 20-49 50+ Twitter Population age Automatic classification of POLLDATA. age gender
  21. 21. How large is the bias? Based on reserve geo-coding & population data for Indonesia. location Jakarta Internet penetration rate: 17% location
  22. 22. Results @flickr:nathanmac87
  23. 23. From tweets to users tweet count 56.45% 3.3% +7 23/3343.55% -13 0.27 W idodo Subianto MAE traditional polls province level correct min. MAE 26.09 max. MAE user count 54.45% 1.3% +4 24/3345.55% -16 0.05 25.01 On the national level, “one user one vote” outperforms tweet-based predictions (confirming prior works). On the province level the changes are miniscule. our baselines
  24. 24. Keyword selection all keywords candidate name 5 keywords Simply using more keywords does not always lead to better results.
  25. 25. Location de-biasing tweet count 55.14% 2.0% +544.86% -15 W idodo Subianto MAE traditional polls user count 54.26% 1.1% +245.74% -18 Decreasing the influence of tweets from overrepresented locations in the dataset improves the prediction.
  26. 26. Gender de-biasing tweet count 56.36% 3.2% +7 21/3343.64% -13 0.33 W idodo Subianto MAE traditional polls province level correct min. MAE 28.05 max. MAE user count 54.89% 1.7% +5 23/3345.11% -15 0.10 26.72 Correcting for gender biases degrades the prediction accuracy on the national & province level.
  27. 27. Impact of sentiment tweet count 53.98% 0.8% +046.02% -20 W idodo Subianto MAE traditional polls province level correct min. MAE max. MAE user count 54.02% 0.9% +045.98% -20 On the national level, sentiment yields the best forecast. tweet count 50.67% 2.5% +549.33% -15 user count 53.77% 0.6% +046.23% -20 14/33 0.01 54.90 19/33 0.26 26.51 14/33 0.01 49.79 19/33 0.01 26.40 POSPOS+NEG The impact on the province level prediction is negative.
  28. 28. Impact of sentiment tweet count 53.98% 0.8% +046.02% -20 W idodo Subianto MAE traditional polls province level correct min. MAE max. MAE user count 54.02% 0.9% +045.98% -20 On the national level, sentiment yields the best forecast. tweet count 50.67% 2.5% +549.33% -15 user count 53.77% 0.6% +046.23% -20 14/33 0.01 54.90 19/33 0.26 26.51 14/33 0.01 49.79 19/33 0.01 26.40 POSPOS+NEG The impact on the province level prediction is negative. More than 700 languages are spoken in Indonesia
  29. 29. Conclusions Simple Twitter-based predictors outperform (almost) all traditional polls in Indonesia. Accurate predictions on province level are challenging, due to data sparsity & data diversity. Currently: designing a Web application prototype to automatically observe ongoing elections.
  30. 30. Thank you. c.hauff@tudelft.nl

×