Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Impact of Sampling on Diffusion on Twitter

2,542 views

Published on

Published in: Technology
  • Be the first to comment

Impact of Sampling on Diffusion on Twitter

  1. 1. How Does the Data Sampling Strategy Impact the Discovery of Information Diffusion in Social Media?<br />Munmun De Choudhury1, Yu-Ru Lin1, Hari Sundaram1,<br />K. Selcuk Candan1, Lexing Xie2, Aisling Kelliher1<br />1Arizona State University, Tempe, AZ<br />2IBM T. J. Watson Research Center, Hawthorne, NY<br />
  2. 2. This talk is about sampling the social web<br />
  3. 3. Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />Experimental Study<br />Conclusions<br />
  4. 4. 5/23/10<br />4<br />Modern Social Interactional Modes<br />Slashdot<br />Facebook<br />Engadget<br />LiveJournal<br />Digg<br />Twitter<br />MetaFilter<br />Flickr<br />Reddit<br />Orkut<br />Blogger<br />YouTube<br />MySpace<br />
  5. 5. We are attracted to social media, in part due to large scale datasets<br />
  6. 6. Viral Marketing, Advertizing Campaigns<br />
  7. 7. Collaboration, “Wisdom of the Crowds”<br />
  8. 8. Crisis management w.r.t. real-time events<br />
  9. 9. Is there something more fundamental happening here than just scale?<br />
  10. 10. Information Diffusion<br />
  11. 11.
  12. 12. 140 characters can cause revolutions<br />
  13. 13.
  14. 14.
  15. 15. Inference is based on data quality<br />
  16. 16. What has been done?<br />
  17. 17. Snowball<br />
  18. 18. Random walk<br />
  19. 19. Forest Fire<br />[Leskovec et. al, KDD 2005]<br />
  20. 20. Designed to capture topology<br />
  21. 21. But not context or content!<br />
  22. 22. Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />Experimental Study<br />Conclusions<br />
  23. 23. Two simple questions<br />
  24. 24. What is the role of context in the sampling process?<br />
  25. 25. What fraction of the social data should we sample?<br />
  26. 26. Twitter<br />“Tweets”: 140 character length shared content.<br />RT (or re-tweet feature), hashtags (e.g. #iranelection), bit.ly encoded URLs<br />Follower / Following relationship.<br />“Trending topics” e.g. #musicmonday, #formulaone.<br />5/23/10<br />26<br /><ul><li>Diffusion via (1) RT feature, (2) shared URL (e.g. bit.ly, tinyurl), (3) same hashtag</li></ul>(1) RT based diffusion<br />(2) URL based diffusion<br />(3) hashtag based diffusion<br />
  27. 27. Diffusion Series<br />27<br />Social graph<br />Diffusion series<br />5/23/10<br />
  28. 28. Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />Experimental Study<br />Conclusions<br />
  29. 29. What are our sampling strategies?<br />
  30. 30. Assume we are give N, the number of nodes to pick<br />
  31. 31. And θ, the topic<br />
  32. 32. Plus the social graph G<br />
  33. 33. What if we ignored topology?<br />
  34. 34.
  35. 35. We can also sample topology using Forest Fire<br />
  36. 36.
  37. 37. Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />Experimental Study<br />Conclusions<br />
  38. 38. Diffusion Saturation Metrics<br />User-based(Volume, participation, dissemination)<br />Topology-based(Reach, spread, cascade instances, collection size)<br />Time-based(Rate)<br />38<br />Volume<br />Spread<br />Rate<br />Participation<br />Dissemination<br />Reach<br />Collection Size<br />Cascade Instances<br />5/23/10<br />
  39. 39. Distortion <br />
  40. 40. Diffusion Response Metrics<br />40<br /><ul><li>Search and News Trends, i.e. ability of the social graph sample to correlate with external temporal variables like user search behavior and news items featured online (http://news.google.com/), given as:</li></ul>5/23/10<br />
  41. 41. Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />Experimental Study<br />Conclusions<br />
  42. 42. Reference Set<br />
  43. 43. 500 seed users using mashable.com<br />
  44. 44. Experimental Setup<br />~465K users, ~836K edges (“follower” / “following” relationships) and 29.5M tweets.<br />125 randomly chosen “trending topics” from Twitter, between Oct and Nov 2009.<br />Trending topic – theme association based on OpenCalais (http://www.opencalais.com/).<br />5/23/10<br />44<br /><ul><li>Dataset released for non-commercial research purposes:</li></ul>http://www.public.asu.edu/~mdechoud/temp/released-data/<br />
  45. 45. R1<br />
  46. 46.
  47. 47.
  48. 48. Is this a slam dunk for forest fire + activity?<br />
  49. 49. Looking at themes tells a more nuanced story<br />
  50. 50.
  51. 51.
  52. 52.
  53. 53.
  54. 54. What is a “good” sampling ratio?<br />
  55. 55. What happens when ρ = 0.3?<br />
  56. 56.
  57. 57. Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />Experimental Study<br />Conclusions<br />
  58. 58. Social networks are causing significant changes in our lives<br />
  59. 59. Inferences about social phenomena is affected by data quality<br />
  60. 60. Topic + topology + seed attribute makes a difference to sampling<br />
  61. 61. Thanks!<br />Questions? <br />munmun@asu.edu<br />For Publications / Datasets: <br />http://www.public.asu.edu/~mdechoud/<br />Twitter: @munmun10<br />

×