Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media

551 views

Published on

Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media

Published in: Social Media
  • Be the first to comment

  • Be the first to like this

Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media

  1. 1. Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media 1 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  2. 2. Alexander C. Nwala Supervisor: Michael L. Nelson and Co-supervisor: Michele C. Weigle Old Dominion University Web Science & Digital Libraries Research Group @acnwala • @WebSciDL Joint Conference on Digital Libraries (JCDL) Doctoral Consortium June 3, 2018, Fort Worth, TX This work was made possible in part by IMLS LG-71-15-0077-15 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media 2 Thank you SIGIR for the Travel Grant @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  3. 3. Outline 1. Introduction and Motivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 3 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  4. 4. In March 2014, there was a serious outbreak of Ebola in West Africa 1 https://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/ 4 The outbreak severely affected Guinea, Liberia, and Sierra Leone with about 11,000 deaths1. http://wayback.archive-it.org/4887/20141028153039/http://blogs.msf.org/en/staff/blogs/msf-ebola-blog @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  5. 5. 5 http://wayback.archive-it.org/4887/20141022093244/http://blog.usaid.gov/ebola/ A few months after the Ebola outbreak, an Archivist at the National Library of Medicine (NLM) collected seeds on Archive-It for the Ebola virus outbreak. @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  6. 6. 6 Archive-It Ebola virus seeds http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/ http://www.cdc.gov/vhf/ebola/pdf/facts-about-ebola-french.pdf http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html http://www.cdc.gov/vhf/ebola/french/2014-west-africa/previous-updates.html http://www.cdc.gov/vhf/ebola/french/2014-west-africa/index.html http://www.cdc.gov/vhf/ebola/exposure/monitoring-and-movement-of-persons-with-exposure.html http://www.cdc.gov/mmwr/ebola_reports.html http://www.cdc.gov/media/DPK/2014/dpk-ebola-outbreak.html http://www.acf.hhs.gov/programs/ohsepr/resource/ebola-planning-considerations http://healthycanadians.gc.ca/diseases-conditions-maladies-affections/disease-maladie/ebola/index-eng.php http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/whats-new.html http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/index.html http://nypost.com/2014/10/29/ebola-doctor-lied-about-his-nyc-travels-police/ http://www.npr.org/blogs/goatsandsoda/2014/08/20/341869218/if-salt-n-pepa-told-you-to-brush-your-teeth-youd-surely-listen/ http://www.doctorswithoutborders.org/article/msf-protocols-staff-returning-ebola-affected-countries/ http://blogs.plos.org/dnascience/2014/11/06/eman-reports-ebola-ground-zero/ http://blogs.plos.org/globalhealth/2014/11/ebola_and_human_rights/ http://www.philly.com/philly/blogs/public_health/Yellow-fever-and-Ebola-similar-scourges-centuries-apart.html http://www.philly.com/philly/blogs/public_health/Syracuse-University-can-teach-us-a-lot-about-Ebola-panic.html http://www.philly.com/philly/blogs/public_health/Why-not-ban-travel-to-stop-Ebola.html http://www.pressherald.com/2014/10/17/fearing-ebola-strong-elementary-teacher-on-leave-after-traveling-to-dallas/ http://www.politico.com/magazine/story/2014/10/how-the-media-stoked-ebola-panic-112095.html http://www.washingtonpost.com/news/post-nation/wp/2014/10/27/nurse-detained-under-new-jerseys-ebola-quarantine-to-be-released/?hpid=z1 http://blogs.scientificamerican.com/doing-good-science/2014/10/31/ebola-abundant-caution-and-sharing-a-world/ http://www.scientificamerican.com/article/ebola-exacerbates-west-africa-s-poverty-crisis/?WT.mc_id=SA_WR_20141105 http://www.scientificamerican.com/article/let-s-talk-about-ebola-survivors-and-sex/?WT.mc_id=SA_WR_20141105 http://www.who.int/mediacentre/news/statements/2014/ebola-20140808/en/ http://federalsoup.com/articles/2014/10/31/army-to-set-up-ebola-testing-labs-in-liberia.aspx http://blogs.plos.org/speakingofmedicine/2014/10/22/ebola-taught-us-crucial-lesson-views-irrational-health-behaviors/ http://blogs.plos.org/speakingofmedicine/2014/10/31/social-pathways-ebola-virus-disease-rural-sierra-leone-implications-containment/ http://blogs.plos.org/speakingofmedicine/2014/10/31/rapid-response-ebola/ Sample of Archive-It Ebola virus seeds
  7. 7. ● A seed list is an initial collection exemplar web pages for a topic ○ seeds + linked pages form a collection when crawled ● Archived web collections consist of groups of web pages that share a common topic e.g., “Ebola virus” and “2018 Winter Olympics.” ● Human-generated seeds are high-quality, but expensive to generate 7 Archived web collections begin with seeds @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  8. 8. Archived web collections offer a way of preserving the historic record of important events 8 http://xhosaculture.co.za/ Mandela’s legacy https://www.wsj.com/ 2016 Dakota Access Pipeline http://www.nj.com/ 2018 Winter Olympics http://xhosaculture.co.za/ Mandela’s legacy @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  9. 9. ● The Internet Archive and Archive-It (a service of the Internet Archive) have on multiple occasions requested that users submit seeds via Google Docs for: 9 Seeds may be generated by multiple users
  10. 10. 10 Sample seeds contributed for the Boston Marathon Bombing Collection
  11. 11. 12 Users on social media share stories that include hand-selected URIs ● The Wikipedia page about the Stoneman Douglas High School shooting ● created the same day as the shooting event (February 14, 2018) ● We consider Wikipedia references an example of a Micro-collection We propose extracting URIs from micro-collections such as Wikipedia references to generate seeds
  12. 12. 13 More micro-collections: extract URIs from Twitter Moments to generate seeds: ● Stoneman Douglas High School shooting Twitter Moment created the day after event
  13. 13. 14 Storify story published Jan 2014: “Protests In Kiev Turn Violent,” before the major event: Russian annexation of Crimea (started late February 2014) Micro-collections often start early before major events
  14. 14. 15 Archive-It collection for the event potentially omits some of the prelusive contents in the Storify micro-collection Micro-collections may include prelusive events lacking in collections triggered by major events Storify story of the Ukrainian crisis event (January 2014) highlights riots before Russian annexation of Crimea (late February).
  15. 15. We propose extracting URIs from social media micro-collections to bootstrap archived collections or augment curator-selected seeds. 16 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  16. 16. http://blogs.plos.org/dnascience/2014/11/06/eman-reports-ebola-ground- zero/ http://blogs.plos.org/globalhealth/2014/11/ebola_and_human_rights/ https://www.cdc.gov/vhf/ebola/history/2014-2016-outbreak/index.html https://twiter.com/ebola_response/ https://www.who.int/mediacentre/news/ebola/en/ http://allafrica.com/stories/201407310957.html http://america.aljazeera.com/articles/2014/8/1/ebola-explainer.html http://jid.oxfordjournals.org/content/204/suppl_3/S785_long http://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0005804 https://www.youtube.com/watch?v=XasTcDsDfMg&feature=youtu.be https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074192 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4313106 http://apps.who.int/gho/data/view.ebola-sitrep.ebola-summary-latest http://www.who.int/csr/disease/ebola/en/ http://www.nature.com/articles/nature10348 17 Sample of seed URIs for Ebola virus topic Reddit SERP and comments Archive-It seeds Wikipedia references Micro-collections
  17. 17. Taking the effort to create micro-collections is an indication of editorial effort, and thus presumably quality of the seeds. 18 Wikipedia references for Stoneman Douglas High School Shooting
  18. 18. Outline 1. Introduction and Motivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 19 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  19. 19. ● Research question 1: ○ Are seeds that are generated automatically from micro-collections in social media comparable to curator-generated seeds? ○ What quantitative method(s) can be used to compare collections? ● Research question 2: ○ If we consider curator hand-selected seeds the gold standard for collections, could this lead to the definition of what makes a collection good? ○ How do we assess the quality of collections at scale? 20 Primary research questions @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  20. 20. Outline 1. Introduction and Motivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 21 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  21. 21. ● We implemented a prototype system for generating seeds from the following social media sites: ○ Storify (out of service since May 16, 2018), ○ Twitter Moments, ○ Reddit, and ○ Wikipedia ● We also generated seeds from the Google SERP as a baseline to compare social media micro-collections since we believe SERPs are a primary source of discovering seeds. 22 Generating seed URIs from social media
  22. 22. 23 Social media micro-collection were similar to Archive-It seeds ≈ Euclidean distance range between collections: 0.17 to
  23. 23. ● Storify was a social media curation service that enables users to create stories that consist of hand-selected web resources such as: ○ URIs of news articles, images, videos, etc. ○ Seeds = URIs in Storify stories ○ Unfortunately, Storify went out of service in May 2018 ○ http://ws- dl.blogspot.com/2017/08/2017-08- 11-where-can-we-post-stories.html 24 Generating seeds from Storify
  24. 24. ● Twitter Moments is a service by Twitter that lets users create topical collections of tweets. ● Tweets in Twitter Moments embed ○ URIs of news articles, images, videos, etc. ○ Seeds = URIs in Twitter Moments 25 Generating seeds from Twitter Moments
  25. 25. ● Reddit is a service that allows users to post URIs for various topics. ● Reddit users rate the URIs and post comments that may also include URIs ○ Seeds = URIs in Reddit pages and comments 26 Generating seeds from Reddit @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  26. 26. ● Wikipedia is a service that enables multiple contributors to create documents about various topics ranging from politics to science and technology ● The references of Wikipedia documents include URIs relevant to the document topic ○ Seeds = URIs in Wikipedia references 27 Generating seeds from Wikipedia Wikipedia references for Stoneman Douglas High School Shooting
  27. 27. 28 Not all micro-collections yield high quality seeds: How do we recognize low quality seeds at scale? Spam links in tweetsHijacked hashtagCan we assess authority of source? Infowars @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  28. 28. 29 Example of potential seeds generated from the Google SERP for query: “hurricane harvey” Seeds generated from SERPs can be used as a baseline to compare social media micro-collections @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  29. 29. 31 ● Daily prob. of finding the URI of the same story: 0.34 - 0.44, weekly: 0.01 - 0.11, and monthly rate: 0.01 - 0.08 The probability of finding the URI of a news story diminishes with time
  30. 30. Outline 1. Introduction and Motivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 32 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  31. 31. 33 ● It requires comparing collections that may cater to different needs ● We explored foundational work in collection characterization from Library and Web Sciences ○ Defined a suite of 7 measures (Collection Characterizing Suite - CCS) ● The CCS is used to describe individual collections and compare multiple collections Characterizing or comparing collections is a challenging task @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  32. 32. 34 1. Distribution of topics: a ranked list of topics in a collection a. “ebola outbreak west africa” b. “guinea liberia sierra leone” c. “cases ebola virus disease” d. “public health workers” e. “centers disease control prevention” 1. Distribution of sources (hostnames): a statistical summary of the various sources sampled in order to build the collection: a. 18 (12.5%) web pages from blogs.plos.org, b. 14 (9.7%) from cdc.gov, and c. 11 (7.6%) from twitter.com (Top 10 hosts fraction of collection: 50%) 1. Temporal distribution - Publication & Content: collection of the dates in a collection: “From August 2014–December 2015, the guidance was accessed online...The guidance was retired on February 19, 2016, when more than 45 days had passed since Guinea was declared free of Ebola virus transmission, because widespread human-to-human transmission was at an end” Page last updated: December 27, 2017 CCS: NLM Ebola virus collection example
  33. 33. 35 4. Content diversity: a value between 0 and 1 indicating the degree of self-similarity of the text content of the collection ○ 0 - no diversity; duplicate documents ○ 1 - maximum diversity; documents without any common vocabulary Quantifying textual diversity in a collection @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  34. 34. 36 Content diversity example (colors = collections, numbers = stories)ID News Titles Collections 1 “Donald Trump Congratulates Roy Moore for Primary Win” 2 “Trump offers congratulations to Roy Moore” Roy Moore Wins 3 “Roy Moore wins Alabama Senate GOP primary runoff” 4 “Harvey Puts Houston Underwater” 5 “Hurricane Harvey intensifies to Category 2 storm” Hurricane Harvey 6 “Harvey Puts Houston Underwater” 7 “Mass Shooting in Las Vegas” 8 “Mass Shooting Outside Las Vegas’ Mandalay Bay” Vegas Shooting 9 “Las Vegas shooting: What we know” diversity scoresCollections = 0.39 = 0.58 = 0.30 1 1 1 = 0.00 = 0.00 = 0.00 = 1.00 = 1.00 = 0.75 2 2 2 3 3 3 1 2 3 1 2 3 1 2 3 1 4 7 1 8 9 1 2 3 4 5 6 7 8 9
  35. 35. 37 5. Source diversity - URI, Domain, Hostname, and Social media: indicates whether a collection samples a single source, a handful of sources, or many sources. There are multiple ways of measuring URL diversity http://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html
  36. 36. 38 6. Collection exposure - Archival rate and Tweet index rate: approximates popularity ● Archival rate: fraction of archived URIs in collection ● Tweet index rate: fraction of URIs in collection found embedded in tweets 7. Target audience: approximates target audience of a collection with readability scores grade level - title - source CCS: Approximating popularity and target audience 7th - “History of hurricanes in Texas, by the numbers” - abcnews.go.com 11th - “Trump faces leadership test with Hurricane Harvey” - thehill.com 12th - “Harvey Puts Houston Underwater” - dailycaller.com 18th (graduate) - “Ebola virus entry requires the host-programmed recognition of an intracellular receptor” - nih.gov 20th (graduate) - “Virus taxonomy classification and nomenclature of viruses”- sciencemag.org @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  37. 37. 39 ● We represented each collection as an n-dimensional vector of CCS values. ● Calculated distance between vectors. Comparing collections with CCS Doc-Term content diversity 0.86 0.89 List of entity set content diversity 0.65 0.85 URI diversity 1.00 0.98 Domain diversity 0.34 0.50 Hostname diversity 0.43 0.53 Social media rate 0.07 0.12 Archical rate 0.99 0.78 Tweet index rate 0.72 0.40 Exposure rate (reading level) 0.61 0.61 n-gram similarity of topic distribution 1.00 0.70 Normalized Euclidean distance 0.17 Archive-It Col. Reddit Col.CCS metrics
  38. 38. Outline 1. Introduction and Motivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 40 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  39. 39. 41 JCDL 2016: need for using local news sources to build collections for local events. Summary of completed research (2016) @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  40. 40. 42 HyperText 2018: Introduced Collection Characterizing Suite for characterizing and comparing collections Summary of completed research (2017-2018) @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  41. 41. 43 JCDL 2018: Investigated discoverability of URIs of news stories on SERPs Summary of completed research (2017-2018) @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  42. 42. 44 ● Studying SERPs: A Supervised Learning Algorithm for Binary Domain Classification of Web Queries using SERPs (JCDL 2016 Poster) ● Interacting with Twitter a. Extracting tweet conversations b. Finding URLs on Twitter ● Extracting text from news documents ● Finding Storify stories Outline of work that informed this research @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  43. 43. Outline 1. Introduction and Motivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 45 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  44. 44. 46 Schedule for pending research for 2018-2019 2018-06 2019-12 Identify hubs & authorities in social media 2018-12 2018-06 - 2018-12 Candidacy proposal 2018-06 - 2018-12 Implement seed generation system 2019-01 - 2019-03 2019-04 - 2018-08Evaluate seed generation system Dissertation/Defense 2019-09 - 2019- 12 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  45. 45. Conclusions 47 ● Archived collections offers a way of preserving the historic record of important events and begin with seeds. ● We propose exploiting micro-collections on social media to augment or bootstrap archived collections for stories and events. ○ Introduced the CCS for characterizing and comparing collections ○ We showed that micro-collections generated from from social media are similar to Archive-It seeds ● Primary research tasks remaining: ○ Identify hubs and authorities in social media a method to evaluate quality at scale ○ Investigate what makes “good” seeds and implement/evaluate seed generation system @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media @acnwala @webscidl Thank you!

×