How Does the Data Sampling Strategy Impact the Discovery of Information Diffusion in Social Media?<br />Munmun De Choudhur...
This talk is about sampling the social web<br />
Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />E...
5/23/10<br />4<br />Modern Social Interactional Modes<br />Slashdot<br />Facebook<br />Engadget<br />LiveJournal<br />Digg...
We are attracted to social media, in part due to large scale datasets<br />
Viral Marketing, Advertizing Campaigns<br />
Collaboration, “Wisdom of the Crowds”<br />
Crisis management w.r.t. real-time events<br />
Is there something more fundamental happening here than just scale?<br />
Information Diffusion<br />
140 characters can cause revolutions<br />
Inference is based on data quality<br />
What has been done?<br />
Snowball<br />
Random walk<br />
Forest Fire<br />[Leskovec et. al, KDD 2005]<br />
Designed to capture topology<br />
But not context or content!<br />
Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />E...
Two simple questions<br />
What is the role of context in the sampling process?<br />
What fraction of the social data should we sample?<br />
Twitter<br />“Tweets”: 140 character length shared content.<br />RT (or re-tweet feature), hashtags (e.g. #iranelection), ...
Diffusion Series<br />27<br />Social graph<br />Diffusion series<br />5/23/10<br />
Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />E...
What are our sampling strategies?<br />
Assume we are give N, the number of nodes to pick<br />
And θ, the topic<br />
Plus the social graph G<br />
What if we ignored topology?<br />
We can also sample topology using Forest Fire<br />
Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />E...
Diffusion Saturation Metrics<br />User-based(Volume, participation, dissemination)<br />Topology-based(Reach, spread, casc...
Distortion <br />
Diffusion Response Metrics<br />40<br /><ul><li>Search and News Trends, i.e. ability of the social graph sample to correla...
Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />E...
Reference Set<br />
500 seed users using mashable.com<br />
Experimental Setup<br />~465K users, ~836K edges (“follower” / “following” relationships) and 29.5M tweets.<br />125 rando...
R1<br />
Is this a slam dunk for forest fire + activity?<br />
Looking at themes tells a more nuanced story<br />
What is a “good” sampling ratio?<br />
What happens when ρ = 0.3?<br />
Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />E...
Social networks are causing significant changes in our lives<br />
Inferences about social phenomena is affected by data quality<br />
Topic + topology + seed attribute makes a difference to sampling<br />
Thanks!<br />Questions? <br />munmun@asu.edu<br />For Publications / Datasets: <br />http://www.public.asu.edu/~mdechoud/<...
Upcoming SlideShare
Loading in …5
×

Impact of Sampling on Diffusion on Twitter

2,312 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,312
On SlideShare
0
From Embeds
0
Number of Embeds
25
Actions
Shares
0
Downloads
39
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • I believe so – I think social networks will change our understanding of media semantics in a fundamental way and not just because we have more data;
  • Densification, shrinking diameter; generative model
  • Volume, participation and dissemination
  • rate
  • Reach, spread, cascade instances and collection size
  • Impact of Sampling on Diffusion on Twitter

    1. 1. How Does the Data Sampling Strategy Impact the Discovery of Information Diffusion in Social Media?<br />Munmun De Choudhury1, Yu-Ru Lin1, Hari Sundaram1,<br />K. Selcuk Candan1, Lexing Xie2, Aisling Kelliher1<br />1Arizona State University, Tempe, AZ<br />2IBM T. J. Watson Research Center, Hawthorne, NY<br />
    2. 2. This talk is about sampling the social web<br />
    3. 3. Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />Experimental Study<br />Conclusions<br />
    4. 4. 5/23/10<br />4<br />Modern Social Interactional Modes<br />Slashdot<br />Facebook<br />Engadget<br />LiveJournal<br />Digg<br />Twitter<br />MetaFilter<br />Flickr<br />Reddit<br />Orkut<br />Blogger<br />YouTube<br />MySpace<br />
    5. 5. We are attracted to social media, in part due to large scale datasets<br />
    6. 6. Viral Marketing, Advertizing Campaigns<br />
    7. 7. Collaboration, “Wisdom of the Crowds”<br />
    8. 8. Crisis management w.r.t. real-time events<br />
    9. 9. Is there something more fundamental happening here than just scale?<br />
    10. 10. Information Diffusion<br />
    11. 11.
    12. 12. 140 characters can cause revolutions<br />
    13. 13.
    14. 14.
    15. 15. Inference is based on data quality<br />
    16. 16. What has been done?<br />
    17. 17. Snowball<br />
    18. 18. Random walk<br />
    19. 19. Forest Fire<br />[Leskovec et. al, KDD 2005]<br />
    20. 20. Designed to capture topology<br />
    21. 21. But not context or content!<br />
    22. 22. Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />Experimental Study<br />Conclusions<br />
    23. 23. Two simple questions<br />
    24. 24. What is the role of context in the sampling process?<br />
    25. 25. What fraction of the social data should we sample?<br />
    26. 26. Twitter<br />“Tweets”: 140 character length shared content.<br />RT (or re-tweet feature), hashtags (e.g. #iranelection), bit.ly encoded URLs<br />Follower / Following relationship.<br />“Trending topics” e.g. #musicmonday, #formulaone.<br />5/23/10<br />26<br /><ul><li>Diffusion via (1) RT feature, (2) shared URL (e.g. bit.ly, tinyurl), (3) same hashtag</li></ul>(1) RT based diffusion<br />(2) URL based diffusion<br />(3) hashtag based diffusion<br />
    27. 27. Diffusion Series<br />27<br />Social graph<br />Diffusion series<br />5/23/10<br />
    28. 28. Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />Experimental Study<br />Conclusions<br />
    29. 29. What are our sampling strategies?<br />
    30. 30. Assume we are give N, the number of nodes to pick<br />
    31. 31. And θ, the topic<br />
    32. 32. Plus the social graph G<br />
    33. 33. What if we ignored topology?<br />
    34. 34.
    35. 35. We can also sample topology using Forest Fire<br />
    36. 36.
    37. 37. Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />Experimental Study<br />Conclusions<br />
    38. 38. Diffusion Saturation Metrics<br />User-based(Volume, participation, dissemination)<br />Topology-based(Reach, spread, cascade instances, collection size)<br />Time-based(Rate)<br />38<br />Volume<br />Spread<br />Rate<br />Participation<br />Dissemination<br />Reach<br />Collection Size<br />Cascade Instances<br />5/23/10<br />
    39. 39. Distortion <br />
    40. 40. Diffusion Response Metrics<br />40<br /><ul><li>Search and News Trends, i.e. ability of the social graph sample to correlate with external temporal variables like user search behavior and news items featured online (http://news.google.com/), given as:</li></ul>5/23/10<br />
    41. 41. Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />Experimental Study<br />Conclusions<br />
    42. 42. Reference Set<br />
    43. 43. 500 seed users using mashable.com<br />
    44. 44. Experimental Setup<br />~465K users, ~836K edges (“follower” / “following” relationships) and 29.5M tweets.<br />125 randomly chosen “trending topics” from Twitter, between Oct and Nov 2009.<br />Trending topic – theme association based on OpenCalais (http://www.opencalais.com/).<br />5/23/10<br />44<br /><ul><li>Dataset released for non-commercial research purposes:</li></ul>http://www.public.asu.edu/~mdechoud/temp/released-data/<br />
    45. 45. R1<br />
    46. 46.
    47. 47.
    48. 48. Is this a slam dunk for forest fire + activity?<br />
    49. 49. Looking at themes tells a more nuanced story<br />
    50. 50.
    51. 51.
    52. 52.
    53. 53.
    54. 54. What is a “good” sampling ratio?<br />
    55. 55. What happens when ρ = 0.3?<br />
    56. 56.
    57. 57. Background and Motivation<br />Problem Definition<br />Sampling Diffusion Data<br />Evaluation of Diffusion Samples<br />Experimental Study<br />Conclusions<br />
    58. 58. Social networks are causing significant changes in our lives<br />
    59. 59. Inferences about social phenomena is affected by data quality<br />
    60. 60. Topic + topology + seed attribute makes a difference to sampling<br />
    61. 61. Thanks!<br />Questions? <br />munmun@asu.edu<br />For Publications / Datasets: <br />http://www.public.asu.edu/~mdechoud/<br />Twitter: @munmun10<br />

    ×