Losing My Revolution:    How Many Resources Shared on Social Media                Have Been Lost?                   Hany M...
All tweets are equal…  …but some are more equal than the othersHany SalahEldeen & Michael Nelson    Losing My Revolution   2
Research Questions:• How long would this last?• And if lost, is there a backup somewhere?• Finally, can we model this exis...
Phase 1:   Data GatheringHany SalahEldeen & Michael Nelson   Losing My Revolution   4
Data GatheringWe decided to collect as many posts onsocial media as possible satisfying theseconditions:• Has embedded res...
Six Socially Significant Events• From Twitter, Websites, Books:           •      The Egyptian revolution.• From Twitter On...
Stanford’s SNAP DatasetPreparation:                                                        Extracted                      ...
Twitter Tag Expansion•      With start with initial tags manually assigned related to       the event and extract co-occur...
Twitter Tag Expansion•      We repeat this with all the other 3 events from       SNAPHany SalahEldeen & Michael Nelson   ...
Tweet Filtration•      Using the expanded tags we sort them according to number of       tweets and filter them by co-occu...
Tweet Filtration•      We repeat this for all the other 3 events.•      We might need further random sampling to reduce th...
Egyptian Revolution Dataset•      The social media played a key role in       documenting and driving the revolution.•    ...
Sources UtilizedTweets From Tahrir                                                             Storify.com                ...
Syrian Uprising Dataset•      Since this event was a current event we utilized       the Twitter search API in the extract...
What are people sharing?Hany SalahEldeen & Michael Nelson   Losing My Revolution   15
Data AnalysisFor all the collected data, how manyURIs are:1. unique and how many are repeated?2. still active on the live ...
Phase 2:   Uniqueness and ExistenceHany SalahEldeen & Michael Nelson   Losing My Revolution   17
Uniqueness     A URL can take many                                          % curl -I http://goo.gl/2ViC     different for...
Uniqueness     • Thus, we resolve all the URLs extracted and keep       the final destination URL after redirects (30X    ...
Uniqueness           Collection               All Resources   Unique Resources       Michael Jackson                 2,293...
Existence on the live-web     • For each unique URL we resolved the final HTTP       response and considered 2 classes:   ...
Existence on the live-web           Collection               Resources Missing   Percentage Missing       Michael Jackson ...
Existence in Public Web-Archives     • For each unique URL we downloaded its       timemap utilizing Memento.     • The ag...
Existence in Public Web-Archives           Collection           Resources Archived   Percentage Archived       Michael Jac...
Phase 3:   Existence as a Function of TimeHany SalahEldeen & Michael Nelson   Losing My Revolution   25
Timeline of Events  Social Events Having a Bimodal Time Distribution   List of eventsHany SalahEldeen & Michael Nelson   ...
Resources Missing & Archived            Collection              Percentage Missing   Percentage Archived        Michael Ja...
Resources Missing & ArchivedHany SalahEldeen & Michael Nelson   Losing My Revolution   28
Curve Fitting The DataHany SalahEldeen & Michael Nelson    Losing My Revolution   29
Conclusions   • Measured 21,625 resources from 6 data sets in     archives & live web.   • After a year from publishing ab...
Appendix A:                                    Extra SlidesHany SalahEldeen & Michael Nelson             Losing My Revolut...
Data GatheringStanford’s SNAP Dataset:•      Collection of about 50 large network datasets.•      Twitter posts dataset co...
Existence as a function of time     Dual-Peaked Events:     •     Iranian Elections:           • 13th Jun. 2009: Protests ...
Future Work     In the next steps we will:     • expand the datasets.     • cover the uncovered temporal areas in 2010 and...
Upcoming SlideShare
Loading in …5
×

Losing My Revolution Long Paper TPDL2012

1,232 views

Published on

  • Be the first to comment

  • Be the first to like this

Losing My Revolution Long Paper TPDL2012

  1. 1. Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost? Hany M. SalahEldeen & Michael L. Nelson Old Dominion University Department of Computer ScienceHany SalahEldeen & Michael Nelson Losing My Revolution
  2. 2. All tweets are equal… …but some are more equal than the othersHany SalahEldeen & Michael Nelson Losing My Revolution 2
  3. 3. Research Questions:• How long would this last?• And if lost, is there a backup somewhere?• Finally, can we model this existence?Hany SalahEldeen & Michael Nelson Losing My Revolution 3
  4. 4. Phase 1: Data GatheringHany SalahEldeen & Michael Nelson Losing My Revolution 4
  5. 5. Data GatheringWe decided to collect as many posts onsocial media as possible satisfying theseconditions:• Has embedded resources.• Has a time stamp.• From different sources.• Related to socially significant events.Hany SalahEldeen & Michael Nelson Losing My Revolution 5
  6. 6. Six Socially Significant Events• From Twitter, Websites, Books: • The Egyptian revolution.• From Twitter Only: • Stanford’s SNAP dataset: • Iranian elections. • H1N1 virus outbreak. • Michael Jackson’s death. • Obama’s Nobel Peace Prize. • Twitter API: • The Syrian uprising.Hany SalahEldeen & Michael Nelson Losing My Revolution 6
  7. 7. Stanford’s SNAP DatasetPreparation: Extracted tweets in English only. Contain embedded resources Contain hash tagsHany SalahEldeen & Michael Nelson Losing My Revolution 7
  8. 8. Twitter Tag Expansion• With start with initial tags manually assigned related to the event and extract co-occurring ones Event Initial Hashtags Top Co-occurring Hashtags #swine = 61,829 #swineflu = 56,419 #flu = 8,436 #pandemic = 6,839 H1N1 Outbreak #h1n1 = 61,351 #influenza = 1,725 #grippe = 1,559 #tamiflu = 331 #cnn = ……. #health = …….Hany SalahEldeen & Michael Nelson Losing My Revolution 8
  9. 9. Twitter Tag Expansion• We repeat this with all the other 3 events from SNAPHany SalahEldeen & Michael Nelson Losing My Revolution 9
  10. 10. Tweet Filtration• Using the expanded tags we sort them according to number of tweets and filter them by co-occurrence. Tweets Event Hashtags selected for filteration Extracted #h1n1 = 61,351 #h1n1 & #swine = 44,972 H1N1 #h1n1 & #swine & #swineflu = 42,574 Outbreak #h1n1 & #swine & #swineflu & #pandemic = 5,517 Final Dataset Size = 5,517Hany SalahEldeen & Michael Nelson Losing My Revolution 10
  11. 11. Tweet Filtration• We repeat this for all the other 3 events.• We might need further random sampling to reduce the size of the datasetHany SalahEldeen & Michael Nelson Losing My Revolution 11
  12. 12. Egyptian Revolution Dataset• The social media played a key role in documenting and driving the revolution.• Millions of tweets, Facebook posts, videos, and images have been shared during the 18 days of the 25th January 2011 revolution.• We manually extracted all the resources we can from the period of 20th January till March 1st.• Hard to extract.Hany SalahEldeen & Michael Nelson Losing My Revolution 12
  13. 13. Sources UtilizedTweets From Tahrir Storify.com IAmJan25.comHany SalahEldeen & Michael Nelson Losing My Revolution 13
  14. 14. Syrian Uprising Dataset• Since this event was a current event we utilized the Twitter search API in the extraction process.• Similar to the SNAP dataset, we applied hashtag expansion and filtration. Initial Hashtags Top Co-occurring Hashtags #bashar #risedamascus #syria #genocideinsyria #stopassad2012 #assadcrimes #assadHany SalahEldeen & Michael Nelson Losing My Revolution 14
  15. 15. What are people sharing?Hany SalahEldeen & Michael Nelson Losing My Revolution 15
  16. 16. Data AnalysisFor all the collected data, how manyURIs are:1. unique and how many are repeated?2. still active on the live web and how many died?3. archived in one of the public web archives?Hany SalahEldeen & Michael Nelson Losing My Revolution 16
  17. 17. Phase 2: Uniqueness and ExistenceHany SalahEldeen & Michael Nelson Losing My Revolution 17
  18. 18. Uniqueness A URL can take many % curl -I http://goo.gl/2ViC different forms utilizing HTTP/1.1 301 Moved Permanently Content-Type: text/html; charset=UTF-8 numerous URL shortners Cache-Control: no-cache, no-store, max- age=0, must-revalidate Pragma: no-cache http://www.cnn.com Expires: Fri, 01 Jan 1990 00:00:00 GMT Date: Tue, 18 Sep 2012 01:08:44 GMT Location: http://www.cnn.com/ Could be: Server: GSE Transfer-Encoding: chunked http://bit.ly/2EEjBl http://goo.gl/2ViCHany SalahEldeen & Michael Nelson Losing My Revolution 18
  19. 19. Uniqueness • Thus, we resolve all the URLs extracted and keep the final destination URL after redirects (30X redirects). • Then we extract all the unique URLs and remove redundancies.Hany SalahEldeen & Michael Nelson Losing My Revolution 19
  20. 20. Uniqueness Collection All Resources Unique Resources Michael Jackson 2,293 1,187 Iran 3,429 1,340 H1N1 Outbreak 5,517 1,645 Obama 1,118 370 Egypt 7,313 6,154 Syria 1,955 355Hany SalahEldeen & Michael Nelson Losing My Revolution 20
  21. 21. Existence on the live-web • For each unique URL we resolved the final HTTP response and considered 2 classes: • Success: 200 OK • Failure: 4XX, 50X families and the 30X loop redirects or soft 404s.Hany SalahEldeen & Michael Nelson Losing My Revolution 21
  22. 22. Existence on the live-web Collection Resources Missing Percentage Missing Michael Jackson 397 33.45% Iran 339 25.30% H1N1 Outbreak 394 23.95% Obama 92 24.86% Egypt 645 10.48% Syria 25 7.04%Hany SalahEldeen & Michael Nelson Losing My Revolution 22
  23. 23. Existence in Public Web-Archives • For each unique URL we downloaded its timemap utilizing Memento. • The aggregator checks 10+ public web archives for the existence of snapshots. • The resource is declared to be archived if it has at least one Memento.Hany SalahEldeen & Michael Nelson Losing My Revolution 23
  24. 24. Existence in Public Web-Archives Collection Resources Archived Percentage Archived Michael Jackson 406 34.20% Iran 516 38.51% H1N1 Outbreak 693 42.12% Obama 176 47.57% Egypt 1242 20.18% Syria 19 5.35%Hany SalahEldeen & Michael Nelson Losing My Revolution 24
  25. 25. Phase 3: Existence as a Function of TimeHany SalahEldeen & Michael Nelson Losing My Revolution 25
  26. 26. Timeline of Events Social Events Having a Bimodal Time Distribution  List of eventsHany SalahEldeen & Michael Nelson Losing My Revolution 26
  27. 27. Resources Missing & Archived Collection Percentage Missing Percentage Archived Michael Jackson 36.24% 39.45% 31.62% 30.78% Iran 26.98% 43.08% 24.47% 36.26% H1N1 Outbreak 23.49% 41.65% 25.64% 43.87% Obama 24.59% 47.87% 26.15% 46.15% Egypt 10.48% 20.18% Syria 7.04% 5.35%Hany SalahEldeen & Michael Nelson Losing My Revolution 27
  28. 28. Resources Missing & ArchivedHany SalahEldeen & Michael Nelson Losing My Revolution 28
  29. 29. Curve Fitting The DataHany SalahEldeen & Michael Nelson Losing My Revolution 29
  30. 30. Conclusions • Measured 21,625 resources from 6 data sets in archives & live web. • After a year from publishing about 11% of content shared on social media will be gone. • After this we are losing roughly 0.02% daily.Hany SalahEldeen & Michael Nelson Losing My Revolution 30
  31. 31. Appendix A: Extra SlidesHany SalahEldeen & Michael Nelson Losing My Revolution
  32. 32. Data GatheringStanford’s SNAP Dataset:• Collection of about 50 large network datasets.• Twitter posts dataset comprises nearly ½ Billion Tweet.• Posted from June 1st 2009 till December 31st 2009.• Nearly 17 million users.• Nearly 20-30% of the total posts published by Twitter during this period.Hany SalahEldeen & Michael Nelson Losing My Revolution
  33. 33. Existence as a function of time Dual-Peaked Events: • Iranian Elections: • 13th Jun. 2009: Protests and elections • 1st Aug. 2009: Trials • Michael Jackson’s Death: • 25th Jun. 2009: Death announcement • 10th Jul. 2009: Death unnatural causes • H1N1 Outbreak: • 11th Sept. 2009: Worldwide outbreak • 5th Oct. 2009: Vaccine release • Obama’s Nobel Peace Prize: • 9th Oct. 2009: Prize announcement. • 10th Dec. 2009: Nobel Ceremony  BackHany SalahEldeen & Michael Nelson Losing My Revolution
  34. 34. Future Work In the next steps we will: • expand the datasets. • cover the uncovered temporal areas in 2010 and before 2009. • examine closely the extended points and tune the function with time. • analyze the other factors like: publishing venue, rate of sharing, popularity of authors, and the nature of the event.Hany SalahEldeen & Michael Nelson Losing My Revolution

×