Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
1
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
Alexander C. Nwala
Supervisor: Michael L. Nelson and Co-supervisor: Michele C. Weigle
Old Dominion University
Web Science & Digital Libraries Research Group
@acnwala • @WebSciDL
Joint Conference on Digital Libraries (JCDL) Doctoral Consortium
June 3, 2018, Fort Worth, TX
This work was made possible in
part by IMLS LG-71-15-0077-15
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
2
Thank you SIGIR for the Travel Grant
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
3
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
In March 2014, there was a serious outbreak of Ebola in West Africa
1 https://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/
4
The outbreak severely
affected Guinea, Liberia,
and Sierra Leone with
about 11,000 deaths1.
http://wayback.archive-it.org/4887/20141028153039/http://blogs.msf.org/en/staff/blogs/msf-ebola-blog
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
5
http://wayback.archive-it.org/4887/20141022093244/http://blog.usaid.gov/ebola/
A few months after the Ebola outbreak, an Archivist at the National Library of
Medicine (NLM) collected seeds on Archive-It for the Ebola virus outbreak.
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
6
Archive-It Ebola virus seeds
http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/
http://www.cdc.gov/vhf/ebola/pdf/facts-about-ebola-french.pdf
http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html
http://www.cdc.gov/vhf/ebola/french/2014-west-africa/previous-updates.html
http://www.cdc.gov/vhf/ebola/french/2014-west-africa/index.html
http://www.cdc.gov/vhf/ebola/exposure/monitoring-and-movement-of-persons-with-exposure.html
http://www.cdc.gov/mmwr/ebola_reports.html
http://www.cdc.gov/media/DPK/2014/dpk-ebola-outbreak.html
http://www.acf.hhs.gov/programs/ohsepr/resource/ebola-planning-considerations
http://healthycanadians.gc.ca/diseases-conditions-maladies-affections/disease-maladie/ebola/index-eng.php
http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/whats-new.html
http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/index.html
http://nypost.com/2014/10/29/ebola-doctor-lied-about-his-nyc-travels-police/
http://www.npr.org/blogs/goatsandsoda/2014/08/20/341869218/if-salt-n-pepa-told-you-to-brush-your-teeth-youd-surely-listen/
http://www.doctorswithoutborders.org/article/msf-protocols-staff-returning-ebola-affected-countries/
http://blogs.plos.org/dnascience/2014/11/06/eman-reports-ebola-ground-zero/
http://blogs.plos.org/globalhealth/2014/11/ebola_and_human_rights/
http://www.philly.com/philly/blogs/public_health/Yellow-fever-and-Ebola-similar-scourges-centuries-apart.html
http://www.philly.com/philly/blogs/public_health/Syracuse-University-can-teach-us-a-lot-about-Ebola-panic.html
http://www.philly.com/philly/blogs/public_health/Why-not-ban-travel-to-stop-Ebola.html
http://www.pressherald.com/2014/10/17/fearing-ebola-strong-elementary-teacher-on-leave-after-traveling-to-dallas/
http://www.politico.com/magazine/story/2014/10/how-the-media-stoked-ebola-panic-112095.html
http://www.washingtonpost.com/news/post-nation/wp/2014/10/27/nurse-detained-under-new-jerseys-ebola-quarantine-to-be-released/?hpid=z1
http://blogs.scientificamerican.com/doing-good-science/2014/10/31/ebola-abundant-caution-and-sharing-a-world/
http://www.scientificamerican.com/article/ebola-exacerbates-west-africa-s-poverty-crisis/?WT.mc_id=SA_WR_20141105
http://www.scientificamerican.com/article/let-s-talk-about-ebola-survivors-and-sex/?WT.mc_id=SA_WR_20141105
http://www.who.int/mediacentre/news/statements/2014/ebola-20140808/en/
http://federalsoup.com/articles/2014/10/31/army-to-set-up-ebola-testing-labs-in-liberia.aspx
http://blogs.plos.org/speakingofmedicine/2014/10/22/ebola-taught-us-crucial-lesson-views-irrational-health-behaviors/
http://blogs.plos.org/speakingofmedicine/2014/10/31/social-pathways-ebola-virus-disease-rural-sierra-leone-implications-containment/
http://blogs.plos.org/speakingofmedicine/2014/10/31/rapid-response-ebola/
Sample of Archive-It
Ebola virus seeds
● A seed list is an initial collection exemplar web pages for a topic
○ seeds + linked pages form a collection when crawled
● Archived web collections consist of groups of web pages that share a
common topic e.g., “Ebola virus” and “2018 Winter Olympics.”
● Human-generated seeds are high-quality, but expensive to generate
7
Archived web collections begin with seeds
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
Archived web collections offer a way of preserving the
historic record of important events
8
http://xhosaculture.co.za/
Mandela’s legacy
https://www.wsj.com/
2016 Dakota Access Pipeline
http://www.nj.com/
2018 Winter Olympics
http://xhosaculture.co.za/
Mandela’s legacy
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
● The Internet Archive and Archive-It (a service of the Internet Archive) have
on multiple occasions requested that users submit seeds via Google Docs
for:
9
Seeds may be generated by multiple users
10
Sample seeds
contributed
for the Boston
Marathon
Bombing
Collection
12
Users on social media share stories that include hand-selected URIs
● The Wikipedia page about
the Stoneman Douglas
High School shooting
● created the same day as
the shooting event
(February 14, 2018)
● We consider Wikipedia
references an example of a
Micro-collection
We propose extracting URIs
from micro-collections such
as Wikipedia references to
generate seeds
13
More micro-collections: extract URIs from
Twitter Moments to generate seeds:
● Stoneman Douglas High School shooting Twitter
Moment created the day after event
14
Storify story published Jan 2014:
“Protests In Kiev Turn Violent,”
before the major event:
Russian annexation of Crimea
(started late February 2014)
Micro-collections often start early before major events
15
Archive-It collection for the event
potentially omits some of the prelusive
contents in the Storify micro-collection
Micro-collections may include prelusive events lacking in
collections triggered by major events
Storify story of the Ukrainian crisis event
(January 2014) highlights riots before Russian
annexation of Crimea (late February).
We propose extracting URIs from social media
micro-collections to bootstrap archived collections
or augment curator-selected seeds.
16
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
http://blogs.plos.org/dnascience/2014/11/06/eman-reports-ebola-ground-
zero/
http://blogs.plos.org/globalhealth/2014/11/ebola_and_human_rights/
https://www.cdc.gov/vhf/ebola/history/2014-2016-outbreak/index.html
https://twiter.com/ebola_response/
https://www.who.int/mediacentre/news/ebola/en/
http://allafrica.com/stories/201407310957.html
http://america.aljazeera.com/articles/2014/8/1/ebola-explainer.html
http://jid.oxfordjournals.org/content/204/suppl_3/S785_long
http://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0005804
https://www.youtube.com/watch?v=XasTcDsDfMg&feature=youtu.be
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074192
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4313106
http://apps.who.int/gho/data/view.ebola-sitrep.ebola-summary-latest
http://www.who.int/csr/disease/ebola/en/
http://www.nature.com/articles/nature10348 17
Sample of seed URIs for Ebola virus topic
Reddit SERP and
comments
Archive-It seeds
Wikipedia references
Micro-collections
Taking the effort to create micro-collections is an indication of
editorial effort, and thus presumably quality of the seeds.
18
Wikipedia references for
Stoneman Douglas High
School Shooting
Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
19
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
● Research question 1:
○ Are seeds that are generated automatically from micro-collections in
social media comparable to curator-generated seeds?
○ What quantitative method(s) can be used to compare collections?
● Research question 2:
○ If we consider curator hand-selected seeds the gold standard for
collections, could this lead to the definition of what makes a collection
good?
○ How do we assess the quality of collections at scale?
20
Primary research questions
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
21
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
● We implemented a prototype system for generating seeds from the
following social media sites:
○ Storify (out of service since May 16, 2018),
○ Twitter Moments,
○ Reddit, and
○ Wikipedia
● We also generated seeds from the Google SERP as a baseline to compare
social media micro-collections since we believe SERPs are a primary
source of discovering seeds. 22
Generating seed URIs from social media
23
Social media micro-collection were
similar to Archive-It seeds
≈
Euclidean distance range between collections: 0.17 to
● Storify was a social media curation
service that enables users to create
stories that consist of hand-selected web
resources such as:
○ URIs of news articles, images,
videos, etc.
○ Seeds = URIs in Storify stories
○ Unfortunately, Storify went out of
service in May 2018
○ http://ws-
dl.blogspot.com/2017/08/2017-08-
11-where-can-we-post-stories.html 24
Generating seeds from Storify
● Twitter Moments is a service by Twitter
that lets users create topical collections
of tweets.
● Tweets in Twitter Moments embed
○ URIs of news articles, images,
videos, etc.
○ Seeds = URIs in Twitter Moments
25
Generating seeds from Twitter Moments
● Reddit is a service that allows users to
post URIs for various topics.
● Reddit users rate the URIs and post
comments that may also include URIs
○ Seeds = URIs in Reddit pages and
comments
26
Generating seeds from Reddit
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
● Wikipedia is a service that enables
multiple contributors to create
documents about various topics ranging
from politics to science and technology
● The references of Wikipedia documents
include URIs relevant to the document
topic
○ Seeds = URIs in Wikipedia
references
27
Generating seeds from Wikipedia
Wikipedia references for
Stoneman Douglas High
School Shooting
28
Not all micro-collections yield high quality seeds:
How do we recognize low quality seeds at scale?
Spam links in tweetsHijacked hashtagCan we assess
authority of source?
Infowars
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
29
Example of potential
seeds generated from the
Google SERP for query:
“hurricane harvey”
Seeds generated from SERPs can
be used as a baseline to compare
social media micro-collections
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
31
● Daily prob. of finding the URI of the same story: 0.34 - 0.44, weekly: 0.01
- 0.11, and monthly rate: 0.01 - 0.08
The probability of finding the URI of a news story diminishes
with time
Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
32
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
33
● It requires comparing collections that may cater to different needs
● We explored foundational work in collection characterization from
Library and Web Sciences
○ Defined a suite of 7 measures (Collection Characterizing Suite -
CCS)
● The CCS is used to describe individual collections and compare multiple
collections
Characterizing or comparing collections is a challenging task
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
34
1. Distribution of topics: a ranked list of topics in a collection
a. “ebola outbreak west africa”
b. “guinea liberia sierra leone”
c. “cases ebola virus disease”
d. “public health workers”
e. “centers disease control prevention”
1. Distribution of sources (hostnames): a statistical summary of the
various sources sampled in order to build the collection:
a. 18 (12.5%) web pages from blogs.plos.org,
b. 14 (9.7%) from cdc.gov, and
c. 11 (7.6%) from twitter.com
(Top 10 hosts fraction of collection: 50%)
1. Temporal distribution - Publication & Content: collection of the dates in a collection:
“From August 2014–December 2015, the guidance was accessed online...The guidance
was retired on February 19, 2016, when more than 45 days had passed since Guinea
was declared free of Ebola virus transmission, because widespread human-to-human
transmission was at an end” Page last updated: December 27, 2017
CCS: NLM Ebola virus collection example
35
4. Content diversity: a value between 0 and 1 indicating the degree of
self-similarity of the text content of the collection
○ 0 - no diversity; duplicate documents
○ 1 - maximum diversity; documents without any common
vocabulary
Quantifying textual diversity in a collection
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
36
Content diversity example (colors = collections, numbers = stories)ID News Titles
Collections
1 “Donald Trump Congratulates Roy Moore for Primary Win”
2 “Trump offers congratulations to Roy Moore”
Roy Moore Wins
3 “Roy Moore wins Alabama Senate GOP primary runoff”
4 “Harvey Puts Houston Underwater”
5 “Hurricane Harvey intensifies to Category 2 storm”
Hurricane Harvey
6 “Harvey Puts Houston Underwater”
7 “Mass Shooting in Las Vegas”
8 “Mass Shooting Outside Las Vegas’ Mandalay Bay”
Vegas Shooting
9 “Las Vegas shooting: What we know”
diversity scoresCollections
= 0.39
= 0.58
= 0.30
1 1 1 = 0.00
= 0.00
= 0.00
= 1.00
= 1.00
= 0.75
2 2 2
3 3 3
1 2 3
1 2 3
1 2 3
1 4 7
1 8 9
1 2 3 4 5 6 7 8 9
37
5. Source diversity - URI, Domain, Hostname, and Social media: indicates whether a
collection samples a single source, a handful of sources, or many sources.
There are multiple ways of measuring URL diversity
http://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html
38
6. Collection exposure - Archival rate and Tweet index rate: approximates popularity
● Archival rate: fraction of archived URIs in collection
● Tweet index rate: fraction of URIs in collection found embedded in tweets
7. Target audience: approximates target audience of a collection with readability scores
grade level - title - source
CCS: Approximating popularity and target audience
7th - “History of hurricanes in Texas, by the numbers” - abcnews.go.com
11th - “Trump faces leadership test with Hurricane Harvey” - thehill.com
12th - “Harvey Puts Houston Underwater” - dailycaller.com
18th (graduate) - “Ebola virus entry requires the host-programmed recognition of an intracellular
receptor” - nih.gov
20th (graduate) - “Virus taxonomy classification and nomenclature of viruses”- sciencemag.org
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
39
● We represented
each collection as an
n-dimensional vector
of CCS values.
● Calculated distance
between vectors.
Comparing collections with CCS
Doc-Term content diversity 0.86 0.89
List of entity set content diversity 0.65 0.85
URI diversity 1.00 0.98
Domain diversity 0.34 0.50
Hostname diversity 0.43 0.53
Social media rate 0.07 0.12
Archical rate 0.99 0.78
Tweet index rate 0.72 0.40
Exposure rate (reading level) 0.61 0.61
n-gram similarity of topic distribution 1.00 0.70
Normalized Euclidean distance 0.17
Archive-It Col. Reddit Col.CCS metrics
Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
40
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
41
JCDL 2016: need for using local news sources to build collections for local
events.
Summary of completed research (2016)
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
42
HyperText 2018: Introduced Collection Characterizing Suite for characterizing and
comparing collections
Summary of completed research (2017-2018)
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
43
JCDL 2018: Investigated discoverability of URIs of news stories on SERPs
Summary of completed research (2017-2018)
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
44
● Studying SERPs: A Supervised Learning Algorithm for Binary Domain
Classification of Web Queries using SERPs (JCDL 2016 Poster)
● Interacting with Twitter
a. Extracting tweet conversations
b. Finding URLs on Twitter
● Extracting text from news documents
● Finding Storify stories
Outline of work that informed this research
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
45
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
46
Schedule for pending research for 2018-2019
2018-06 2019-12
Identify hubs & authorities
in social media
2018-12
2018-06 - 2018-12
Candidacy proposal 2018-06 - 2018-12
Implement seed generation
system
2019-01 -
2019-03
2019-04 - 2018-08Evaluate seed generation
system
Dissertation/Defense
2019-09 - 2019-
12
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
Conclusions
47
● Archived collections offers a way of preserving the historic record of important events
and begin with seeds.
● We propose exploiting micro-collections on social media to augment or bootstrap
archived collections for stories and events.
○ Introduced the CCS for characterizing and comparing collections
○ We showed that micro-collections generated from from social media are similar to
Archive-It seeds
● Primary research tasks remaining:
○ Identify hubs and authorities in social media a method to evaluate quality at scale
○ Investigate what makes “good” seeds and implement/evaluate seed generation
system
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
@acnwala @webscidl
Thank you!

Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media

  • 1.
    Bootstrapping Web ArchiveCollections of Stories from Micro-collections in Social Media 1 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 2.
    Alexander C. Nwala Supervisor:Michael L. Nelson and Co-supervisor: Michele C. Weigle Old Dominion University Web Science & Digital Libraries Research Group @acnwala • @WebSciDL Joint Conference on Digital Libraries (JCDL) Doctoral Consortium June 3, 2018, Fort Worth, TX This work was made possible in part by IMLS LG-71-15-0077-15 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media 2 Thank you SIGIR for the Travel Grant @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 3.
    Outline 1. Introduction andMotivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 3 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 4.
    In March 2014,there was a serious outbreak of Ebola in West Africa 1 https://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/ 4 The outbreak severely affected Guinea, Liberia, and Sierra Leone with about 11,000 deaths1. http://wayback.archive-it.org/4887/20141028153039/http://blogs.msf.org/en/staff/blogs/msf-ebola-blog @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 5.
    5 http://wayback.archive-it.org/4887/20141022093244/http://blog.usaid.gov/ebola/ A few monthsafter the Ebola outbreak, an Archivist at the National Library of Medicine (NLM) collected seeds on Archive-It for the Ebola virus outbreak. @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 6.
    6 Archive-It Ebola virusseeds http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/ http://www.cdc.gov/vhf/ebola/pdf/facts-about-ebola-french.pdf http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html http://www.cdc.gov/vhf/ebola/french/2014-west-africa/previous-updates.html http://www.cdc.gov/vhf/ebola/french/2014-west-africa/index.html http://www.cdc.gov/vhf/ebola/exposure/monitoring-and-movement-of-persons-with-exposure.html http://www.cdc.gov/mmwr/ebola_reports.html http://www.cdc.gov/media/DPK/2014/dpk-ebola-outbreak.html http://www.acf.hhs.gov/programs/ohsepr/resource/ebola-planning-considerations http://healthycanadians.gc.ca/diseases-conditions-maladies-affections/disease-maladie/ebola/index-eng.php http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/whats-new.html http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/index.html http://nypost.com/2014/10/29/ebola-doctor-lied-about-his-nyc-travels-police/ http://www.npr.org/blogs/goatsandsoda/2014/08/20/341869218/if-salt-n-pepa-told-you-to-brush-your-teeth-youd-surely-listen/ http://www.doctorswithoutborders.org/article/msf-protocols-staff-returning-ebola-affected-countries/ http://blogs.plos.org/dnascience/2014/11/06/eman-reports-ebola-ground-zero/ http://blogs.plos.org/globalhealth/2014/11/ebola_and_human_rights/ http://www.philly.com/philly/blogs/public_health/Yellow-fever-and-Ebola-similar-scourges-centuries-apart.html http://www.philly.com/philly/blogs/public_health/Syracuse-University-can-teach-us-a-lot-about-Ebola-panic.html http://www.philly.com/philly/blogs/public_health/Why-not-ban-travel-to-stop-Ebola.html http://www.pressherald.com/2014/10/17/fearing-ebola-strong-elementary-teacher-on-leave-after-traveling-to-dallas/ http://www.politico.com/magazine/story/2014/10/how-the-media-stoked-ebola-panic-112095.html http://www.washingtonpost.com/news/post-nation/wp/2014/10/27/nurse-detained-under-new-jerseys-ebola-quarantine-to-be-released/?hpid=z1 http://blogs.scientificamerican.com/doing-good-science/2014/10/31/ebola-abundant-caution-and-sharing-a-world/ http://www.scientificamerican.com/article/ebola-exacerbates-west-africa-s-poverty-crisis/?WT.mc_id=SA_WR_20141105 http://www.scientificamerican.com/article/let-s-talk-about-ebola-survivors-and-sex/?WT.mc_id=SA_WR_20141105 http://www.who.int/mediacentre/news/statements/2014/ebola-20140808/en/ http://federalsoup.com/articles/2014/10/31/army-to-set-up-ebola-testing-labs-in-liberia.aspx http://blogs.plos.org/speakingofmedicine/2014/10/22/ebola-taught-us-crucial-lesson-views-irrational-health-behaviors/ http://blogs.plos.org/speakingofmedicine/2014/10/31/social-pathways-ebola-virus-disease-rural-sierra-leone-implications-containment/ http://blogs.plos.org/speakingofmedicine/2014/10/31/rapid-response-ebola/ Sample of Archive-It Ebola virus seeds
  • 7.
    ● A seedlist is an initial collection exemplar web pages for a topic ○ seeds + linked pages form a collection when crawled ● Archived web collections consist of groups of web pages that share a common topic e.g., “Ebola virus” and “2018 Winter Olympics.” ● Human-generated seeds are high-quality, but expensive to generate 7 Archived web collections begin with seeds @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 8.
    Archived web collectionsoffer a way of preserving the historic record of important events 8 http://xhosaculture.co.za/ Mandela’s legacy https://www.wsj.com/ 2016 Dakota Access Pipeline http://www.nj.com/ 2018 Winter Olympics http://xhosaculture.co.za/ Mandela’s legacy @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 9.
    ● The InternetArchive and Archive-It (a service of the Internet Archive) have on multiple occasions requested that users submit seeds via Google Docs for: 9 Seeds may be generated by multiple users
  • 10.
    10 Sample seeds contributed for theBoston Marathon Bombing Collection
  • 11.
    12 Users on socialmedia share stories that include hand-selected URIs ● The Wikipedia page about the Stoneman Douglas High School shooting ● created the same day as the shooting event (February 14, 2018) ● We consider Wikipedia references an example of a Micro-collection We propose extracting URIs from micro-collections such as Wikipedia references to generate seeds
  • 12.
    13 More micro-collections: extractURIs from Twitter Moments to generate seeds: ● Stoneman Douglas High School shooting Twitter Moment created the day after event
  • 13.
    14 Storify story publishedJan 2014: “Protests In Kiev Turn Violent,” before the major event: Russian annexation of Crimea (started late February 2014) Micro-collections often start early before major events
  • 14.
    15 Archive-It collection forthe event potentially omits some of the prelusive contents in the Storify micro-collection Micro-collections may include prelusive events lacking in collections triggered by major events Storify story of the Ukrainian crisis event (January 2014) highlights riots before Russian annexation of Crimea (late February).
  • 15.
    We propose extractingURIs from social media micro-collections to bootstrap archived collections or augment curator-selected seeds. 16 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 16.
    http://blogs.plos.org/dnascience/2014/11/06/eman-reports-ebola-ground- zero/ http://blogs.plos.org/globalhealth/2014/11/ebola_and_human_rights/ https://www.cdc.gov/vhf/ebola/history/2014-2016-outbreak/index.html https://twiter.com/ebola_response/ https://www.who.int/mediacentre/news/ebola/en/ http://allafrica.com/stories/201407310957.html http://america.aljazeera.com/articles/2014/8/1/ebola-explainer.html http://jid.oxfordjournals.org/content/204/suppl_3/S785_long http://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0005804 https://www.youtube.com/watch?v=XasTcDsDfMg&feature=youtu.be https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074192 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4313106 http://apps.who.int/gho/data/view.ebola-sitrep.ebola-summary-latest http://www.who.int/csr/disease/ebola/en/ http://www.nature.com/articles/nature10348 17 Sample ofseed URIs for Ebola virus topic Reddit SERP and comments Archive-It seeds Wikipedia references Micro-collections
  • 17.
    Taking the effortto create micro-collections is an indication of editorial effort, and thus presumably quality of the seeds. 18 Wikipedia references for Stoneman Douglas High School Shooting
  • 18.
    Outline 1. Introduction andMotivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 19 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 19.
    ● Research question1: ○ Are seeds that are generated automatically from micro-collections in social media comparable to curator-generated seeds? ○ What quantitative method(s) can be used to compare collections? ● Research question 2: ○ If we consider curator hand-selected seeds the gold standard for collections, could this lead to the definition of what makes a collection good? ○ How do we assess the quality of collections at scale? 20 Primary research questions @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 20.
    Outline 1. Introduction andMotivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 21 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 21.
    ● We implementeda prototype system for generating seeds from the following social media sites: ○ Storify (out of service since May 16, 2018), ○ Twitter Moments, ○ Reddit, and ○ Wikipedia ● We also generated seeds from the Google SERP as a baseline to compare social media micro-collections since we believe SERPs are a primary source of discovering seeds. 22 Generating seed URIs from social media
  • 22.
    23 Social media micro-collectionwere similar to Archive-It seeds ≈ Euclidean distance range between collections: 0.17 to
  • 23.
    ● Storify wasa social media curation service that enables users to create stories that consist of hand-selected web resources such as: ○ URIs of news articles, images, videos, etc. ○ Seeds = URIs in Storify stories ○ Unfortunately, Storify went out of service in May 2018 ○ http://ws- dl.blogspot.com/2017/08/2017-08- 11-where-can-we-post-stories.html 24 Generating seeds from Storify
  • 24.
    ● Twitter Momentsis a service by Twitter that lets users create topical collections of tweets. ● Tweets in Twitter Moments embed ○ URIs of news articles, images, videos, etc. ○ Seeds = URIs in Twitter Moments 25 Generating seeds from Twitter Moments
  • 25.
    ● Reddit isa service that allows users to post URIs for various topics. ● Reddit users rate the URIs and post comments that may also include URIs ○ Seeds = URIs in Reddit pages and comments 26 Generating seeds from Reddit @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 26.
    ● Wikipedia isa service that enables multiple contributors to create documents about various topics ranging from politics to science and technology ● The references of Wikipedia documents include URIs relevant to the document topic ○ Seeds = URIs in Wikipedia references 27 Generating seeds from Wikipedia Wikipedia references for Stoneman Douglas High School Shooting
  • 27.
    28 Not all micro-collectionsyield high quality seeds: How do we recognize low quality seeds at scale? Spam links in tweetsHijacked hashtagCan we assess authority of source? Infowars @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 28.
    29 Example of potential seedsgenerated from the Google SERP for query: “hurricane harvey” Seeds generated from SERPs can be used as a baseline to compare social media micro-collections @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 29.
    31 ● Daily prob.of finding the URI of the same story: 0.34 - 0.44, weekly: 0.01 - 0.11, and monthly rate: 0.01 - 0.08 The probability of finding the URI of a news story diminishes with time
  • 30.
    Outline 1. Introduction andMotivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 32 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 31.
    33 ● It requirescomparing collections that may cater to different needs ● We explored foundational work in collection characterization from Library and Web Sciences ○ Defined a suite of 7 measures (Collection Characterizing Suite - CCS) ● The CCS is used to describe individual collections and compare multiple collections Characterizing or comparing collections is a challenging task @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 32.
    34 1. Distribution oftopics: a ranked list of topics in a collection a. “ebola outbreak west africa” b. “guinea liberia sierra leone” c. “cases ebola virus disease” d. “public health workers” e. “centers disease control prevention” 1. Distribution of sources (hostnames): a statistical summary of the various sources sampled in order to build the collection: a. 18 (12.5%) web pages from blogs.plos.org, b. 14 (9.7%) from cdc.gov, and c. 11 (7.6%) from twitter.com (Top 10 hosts fraction of collection: 50%) 1. Temporal distribution - Publication & Content: collection of the dates in a collection: “From August 2014–December 2015, the guidance was accessed online...The guidance was retired on February 19, 2016, when more than 45 days had passed since Guinea was declared free of Ebola virus transmission, because widespread human-to-human transmission was at an end” Page last updated: December 27, 2017 CCS: NLM Ebola virus collection example
  • 33.
    35 4. Content diversity:a value between 0 and 1 indicating the degree of self-similarity of the text content of the collection ○ 0 - no diversity; duplicate documents ○ 1 - maximum diversity; documents without any common vocabulary Quantifying textual diversity in a collection @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 34.
    36 Content diversity example(colors = collections, numbers = stories)ID News Titles Collections 1 “Donald Trump Congratulates Roy Moore for Primary Win” 2 “Trump offers congratulations to Roy Moore” Roy Moore Wins 3 “Roy Moore wins Alabama Senate GOP primary runoff” 4 “Harvey Puts Houston Underwater” 5 “Hurricane Harvey intensifies to Category 2 storm” Hurricane Harvey 6 “Harvey Puts Houston Underwater” 7 “Mass Shooting in Las Vegas” 8 “Mass Shooting Outside Las Vegas’ Mandalay Bay” Vegas Shooting 9 “Las Vegas shooting: What we know” diversity scoresCollections = 0.39 = 0.58 = 0.30 1 1 1 = 0.00 = 0.00 = 0.00 = 1.00 = 1.00 = 0.75 2 2 2 3 3 3 1 2 3 1 2 3 1 2 3 1 4 7 1 8 9 1 2 3 4 5 6 7 8 9
  • 35.
    37 5. Source diversity- URI, Domain, Hostname, and Social media: indicates whether a collection samples a single source, a handful of sources, or many sources. There are multiple ways of measuring URL diversity http://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html
  • 36.
    38 6. Collection exposure- Archival rate and Tweet index rate: approximates popularity ● Archival rate: fraction of archived URIs in collection ● Tweet index rate: fraction of URIs in collection found embedded in tweets 7. Target audience: approximates target audience of a collection with readability scores grade level - title - source CCS: Approximating popularity and target audience 7th - “History of hurricanes in Texas, by the numbers” - abcnews.go.com 11th - “Trump faces leadership test with Hurricane Harvey” - thehill.com 12th - “Harvey Puts Houston Underwater” - dailycaller.com 18th (graduate) - “Ebola virus entry requires the host-programmed recognition of an intracellular receptor” - nih.gov 20th (graduate) - “Virus taxonomy classification and nomenclature of viruses”- sciencemag.org @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 37.
    39 ● We represented eachcollection as an n-dimensional vector of CCS values. ● Calculated distance between vectors. Comparing collections with CCS Doc-Term content diversity 0.86 0.89 List of entity set content diversity 0.65 0.85 URI diversity 1.00 0.98 Domain diversity 0.34 0.50 Hostname diversity 0.43 0.53 Social media rate 0.07 0.12 Archical rate 0.99 0.78 Tweet index rate 0.72 0.40 Exposure rate (reading level) 0.61 0.61 n-gram similarity of topic distribution 1.00 0.70 Normalized Euclidean distance 0.17 Archive-It Col. Reddit Col.CCS metrics
  • 38.
    Outline 1. Introduction andMotivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 40 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 39.
    41 JCDL 2016: needfor using local news sources to build collections for local events. Summary of completed research (2016) @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 40.
    42 HyperText 2018: IntroducedCollection Characterizing Suite for characterizing and comparing collections Summary of completed research (2017-2018) @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 41.
    43 JCDL 2018: Investigateddiscoverability of URIs of news stories on SERPs Summary of completed research (2017-2018) @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 42.
    44 ● Studying SERPs:A Supervised Learning Algorithm for Binary Domain Classification of Web Queries using SERPs (JCDL 2016 Poster) ● Interacting with Twitter a. Extracting tweet conversations b. Finding URLs on Twitter ● Extracting text from news documents ● Finding Storify stories Outline of work that informed this research @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 43.
    Outline 1. Introduction andMotivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 45 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 44.
    46 Schedule for pendingresearch for 2018-2019 2018-06 2019-12 Identify hubs & authorities in social media 2018-12 2018-06 - 2018-12 Candidacy proposal 2018-06 - 2018-12 Implement seed generation system 2019-01 - 2019-03 2019-04 - 2018-08Evaluate seed generation system Dissertation/Defense 2019-09 - 2019- 12 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 45.
    Conclusions 47 ● Archived collectionsoffers a way of preserving the historic record of important events and begin with seeds. ● We propose exploiting micro-collections on social media to augment or bootstrap archived collections for stories and events. ○ Introduced the CCS for characterizing and comparing collections ○ We showed that micro-collections generated from from social media are similar to Archive-It seeds ● Primary research tasks remaining: ○ Identify hubs and authorities in social media a method to evaluate quality at scale ○ Investigate what makes “good” seeds and implement/evaluate seed generation system @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media @acnwala @webscidl Thank you!

Editor's Notes

  • #29 https://twitter.com/tahDeetz/status/494886192299536385 https://twitter.com/SpaceCoastMetal/status/495589749051363328