Scraping SERPs For Archival Seeds - It Matters When You Start

Scraping SERPs for Archival Seeds:
It Matters When You Start
1

Alexander C. Nwala, Michele C. Weigle, and Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
@acnwala • @WebSciDL
Joint Conference on Digital Libraries (JCDL)
June 5, 2018, Fort Worth, TX
This work was made possible in
part by IMLS LG-71-15-0077-15
Scraping SERPs for Archival Seeds:
It Matters When You Start
2
Thank you SIGIR for the Travel Grant
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018

Outline
1. Introduction and Motivation
2. Research questions
3. Methodology
a. Dataset generation, representation, and processing
b. Primitive measures extraction
4. Results
5. Conclusions
3
@acnwala
JCDL 2018 • June 5, 2018

In March 2014, there was a serious outbreak of Ebola in West Africa
1 https://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/
4
The outbreak severely
affected Guinea, Liberia,
and Sierra Leone with
about 11,000 deaths1
.
http://wayback.archive-it.org/4887/20141028153039/http://blogs.msf.org/en/staff/blogs/msf-ebola-blog
@acnwala
JCDL 2018 • June 5, 2018

5
http://wayback.archive-it.org/4887/20141022093244/http://blog.usaid.gov/ebola/
A few months after the Ebola outbreak, an Archivist at the National Library of
Medicine (NLM) collected seeds on Archive-It for the Ebola virus outbreak.
@acnwala
JCDL 2018 • June 5, 2018

6
Archive-It Ebola virus seeds
http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/
http://www.cdc.gov/vhf/ebola/pdf/facts-about-ebola-french.pdf
http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html
http://www.cdc.gov/vhf/ebola/french/2014-west-africa/previous-updates.html
http://www.cdc.gov/vhf/ebola/french/2014-west-africa/index.html
http://www.cdc.gov/vhf/ebola/exposure/monitoring-and-movement-of-persons-with-exposure.html
http://www.cdc.gov/mmwr/ebola_reports.html
http://www.cdc.gov/media/DPK/2014/dpk-ebola-outbreak.html
http://www.acf.hhs.gov/programs/ohsepr/resource/ebola-planning-considerations
http://healthycanadians.gc.ca/diseases-conditions-maladies-affections/disease-maladie/ebola/index-eng.php
http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/whats-new.html
http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/index.html
http://nypost.com/2014/10/29/ebola-doctor-lied-about-his-nyc-travels-police/
http://www.npr.org/blogs/goatsandsoda/2014/08/20/341869218/if-salt-n-pepa-told-you-to-brush-your-teeth-youd-surely-listen/
http://www.doctorswithoutborders.org/article/msf-protocols-staff-returning-ebola-affected-countries/
http://blogs.plos.org/dnascience/2014/11/06/eman-reports-ebola-ground-zero/
http://blogs.plos.org/globalhealth/2014/11/ebola_and_human_rights/
http://www.philly.com/philly/blogs/public_health/Yellow-fever-and-Ebola-similar-scourges-centuries-apart.html
http://www.philly.com/philly/blogs/public_health/Syracuse-University-can-teach-us-a-lot-about-Ebola-panic.html
http://www.philly.com/philly/blogs/public_health/Why-not-ban-travel-to-stop-Ebola.html
http://www.pressherald.com/2014/10/17/fearing-ebola-strong-elementary-teacher-on-leave-after-traveling-to-dallas/
http://www.politico.com/magazine/story/2014/10/how-the-media-stoked-ebola-panic-112095.html
http://www.washingtonpost.com/news/post-nation/wp/2014/10/27/nurse-detained-under-new-jerseys-ebola-quarantine-to-be-released/?hpid=z1
http://blogs.scientificamerican.com/doing-good-science/2014/10/31/ebola-abundant-caution-and-sharing-a-world/
http://www.scientificamerican.com/article/ebola-exacerbates-west-africa-s-poverty-crisis/?WT.mc_id=SA_WR_20141105
http://www.scientificamerican.com/article/let-s-talk-about-ebola-survivors-and-sex/?WT.mc_id=SA_WR_20141105
http://www.who.int/mediacentre/news/statements/2014/ebola-20140808/en/
http://federalsoup.com/articles/2014/10/31/army-to-set-up-ebola-testing-labs-in-liberia.aspx
http://blogs.plos.org/speakingofmedicine/2014/10/22/ebola-taught-us-crucial-lesson-views-irrational-health-behaviors/
http://blogs.plos.org/speakingofmedicine/2014/10/31/social-pathways-ebola-virus-disease-rural-sierra-leone-implications-containment/
http://blogs.plos.org/speakingofmedicine/2014/10/31/rapid-response-ebola/
Sample of Archive-It
Ebola virus seeds

● A seed list is an initial collection exemplar web pages for a topic
○ seeds + linked pages form a collection when crawled
● Archived web collections consist of groups of web pages that share a
common topic e.g., “Ebola virus” and “2018 Winter Olympics.”
● Human-generated seeds are high-quality, but expensive to generate
7
Archived web collections begin with seeds
@acnwala
JCDL 2018 • June 5, 2018

Archived web collections offer a way of preserving the
historic record of important events
8
http://xhosaculture.co.za/
Mandela’s legacy
https://www.wsj.com/
2016 Dakota Access Pipeline
http://www.nj.com/
2018 Winter Olympics
http://xhosaculture.co.za/
Mandela’s legacy
@acnwala
JCDL 2018 • June 5, 2018

● The Internet Archive and Archive-It (a service of the Internet Archive) have on
multiple occasions requested that users submit seeds via Google Docs for:
9
Seeds may be generated by multiple users

10
Sample seeds
contributed
for the Boston
Marathon
Bombing
Collection

1. SOPA blackout (Jan 2012)
2. Hurricane Sandy (Aug 2012)
3. 2012 Occupy movement (May 2012)
4. Aaron Swartz (Jan 2013)
5. Supreme Court hearings DOMA (Mar 2013)
6. Boston Marathon Bombing (Apr 2013)
7. Nelson Mandela (Dec 2013)
8. 2014 Ferguson, MO (Aug 2014)
9. Ebola virus (Oct 2014)
10. 2016 U.S. presidential election (Nov 2016)
11. #DAPL (Dec 2016)
12. 2018 Winter Olympics (Feb 2018)
11
Tweet requests for
other collections
@acnwala
JCDL 2018 • June 5, 2018

● Seeds can be discovered by issuing queries
(e.g. “hurricane harvey”) to Google and
extracting URIs from the SERP (Search
Engine Result Page)
● URIs for older news stories may be difficult
to discover via Google after one month
(research result)
12
Collection building often begins with a simple Google search
@acnwala
JCDL 2018 • June 5, 2018

13
Before extracting seed URIs of news stories from
SERPs, we investigated re-finding URIs on Google.
@acnwala
JCDL 2018 • June 5, 2018

14
News vertical
SERP
All (renamed General)
SERP
● Wikipedia page present
● Older documents
● Wikipedia page absent
● Newer documents (e.g., 2 hours old)

15
Initial stages of the AHCA bill and the
struggles to pass the bill
Later stages of the AHCA bill and the
failure of the bill which happened in
September 2017
Depending on query/topic, new pages displace older pages in SERPs:
query = "healthcare bill"
SERP on May 25, 2017 SERP on January 5, 2018 (7 months later)
Healthcare saga shaping GOP
approach to tax bill (thehill.com)
US Senate’s McConnell sees tough path
for passing healthcare bill (cnbc.com)
Will the Republican Health Care Bill
Really Lower Premiums? (time.com)
House Republicans used lessons from
failed health care bill to pass tax reform,
Ryan says (pbs.org)
GOP tax bill also manages to
needlessly screw up the healthcare
system (latimes.com)
How GOP tax bill’s Obamacare changes
will affect health care and consumers
(chicagotribune.com)

Outline
3. Methodology
4. Results
5. Conclusions
16
@acnwala
JCDL 2018 • June 5, 2018

● RQ1: What is the rate at which new URIs replace old URIs on the SERP over
time?
● RQ2: What is the probability of finding the same URI with the same query
over time?
17
Primary research questions
@acnwala
JCDL 2018 • June 5, 2018

Outline
3. Methodology
4. Results
5. Conclusions
18
@acnwala
JCDL 2018 • June 5, 2018

● For a 7 month period (May 25, 2017 to January 12, 2018) we issued 7
queries every day and extracted URIs from the first 5 SERPs (General & News
Vertical):
1. “healthcare bill”
2. “manchester bombing”
3. “london terrorism”
4. “trump russia”
5. “travel ban”
6. “hurricane harvey”
7. “hurricane irma”
19
Methodology: Dataset generation, representation, and processing
@acnwala
JCDL 2018 • June 5, 2018

20
● Dataset generated by
extracting URIs from SERPs
for seven queries
● Dataset was
semi-automatically
generated with the
http://www.localmemory.org/
collection generator
chrome extension
151,602 URIs
(33,432 unique)

21
Tracking URIs: single query perspective
1: Query issued
2: URIs extracted
Scheme and query
parameters removed to
track URIs
3: URI info
stored in
JSON files
4: Date
URI was
found,
page, etc.

● URI replacement rate, new URI rate, and page-level new URI rate
● Probability of finding a story
● Distribution of stories over time across pages
● Overlap rate and recall (see paper for details)
22
Retrievability of URIs was assessed by extracting
four measures
@acnwala
JCDL 2018 • June 5, 2018

● SERP at t0
: URIs {a,b,c}
● SERP at t1
: URIs {a,b,x,y},
○ URI replacement rate at t1
is (at t1
c was replaced):
● SERP at t0
: URIs {a,b,c},
● SERP at t1
: URIs {a,b,c,d,e},
○ The new URI rate from t0
to t1
is (at t1
we saw new URI d and e):
23
Example: URI replacement rate and new URI rate
@acnwala
JCDL 2018 • June 5, 2018

24
Probability of finding a URI over time and
distribution of URIs over time across pages
3 URIs
Day 1 Day 2 Day 3 Day 4 Day 5
URI-1 4 2 0 0 0
URI-2 1 2 1 0 0
URI-3 1 1 1 1 0
Probabilities 3/3 3/3 2/3 1/3 0/3
URI-1 found on
page 4 on Day 1
URIs 1-3,
NOT found
(pages 1-5)
@acnwala
JCDL 2018 • June 5, 2018

Outline
3. Methodology
4. Results
5. Conclusions
25
@acnwala
JCDL 2018 • June 5, 2018

26
General SERP collections had lower new URI rates, thus produced
URIs with a longer lifespan than News vertical SERP collections
Hurricane Harvey
General SERP News Vertical SERP
@acnwala
JCDL 2018 • June 5, 2018

27
Probability of finding the same URI with the same query on
News vertical SERP after 1 month ≈ 0
Hurricane Harvey
● URIs of some news stories
may not be easily discoverable
if query is issued after 1 month:
○ It matters when users
search for URIs
@acnwala
JCDL 2018 • June 5, 2018

28
The “life span” of URIs is
dependent not just on
SERPs, but also topics
The URIs in “hurricane
harvey” had a “longer life”
than “trump russia” due to a
lower rate of new URIs
General
URIs
News
Vertical URIs
News
Vertical URIs
General
URIs
hurricane harvey
trump russia

Results: the URI
replacement and new
URI rates are strongly
dependent on the topic
E.g., Hurricane Harvey’s
lower daily avg. URI
replacement rate (0.21)
and avg. new URI rate
(0.21)
<
Trump Russia (highest
daily - monthly avg. URI
replacement and avg.
new URI rates)
29
Average URI replacement rate (column markers: min-
, max+
)
Average new URI rate (column markers: min-
, max+
)

30
The probability of finding the same URI of a news story with the
same query decreased with time for both SERPs
● The probability of finding the URI for a news story when the same query is issued
one day after it was first observed:
○ 0.34 - 0.44 (General) vs 0.28 - 0.40 (News Vertical)
● After one week:
○ Weekly: 0.01 - 0.11 (General) vs 0.03 - 0.14 (News Vertical)
● After one month:
○ Monthly: 0.01 - 0.08 (General) vs 0 (News Vertical)

● We fitted a curve over the union of occurrence of the URIs in our dataset with
an exponential model.
● The probability of finding an arbitrary URI of a news story s on a
SERP sp ∈ {General, News Vertical}, after k days is predicted as follows:
31
Generalization of the probability of finding an
arbitrary URI as a function of time (days)

32
URIs show multiple page movement patterns
● Each box represents a URI,
numbers in boxes represent the
page the URI was found:
● https://en.wikipedia.org/wiki/Ma
nchester_Arena_bombing (page
1)
● Color codes:
○ Page 1
○ Page 2
○ Page 3
○ Page 4
○ Page 5
○ White (outside pages 1-5)
May 25, 2017 July 15, 2017
URIs

Rapid/steady rank climbing and falling:
● Rapid climb: Some URIs go from page
5 to 1 (skipping pages 4 - 2),
● Rapid fall: or go from 1 to 5.
● Steady fall: or go from 3 - 2 - 1
33
Persistent vs rank climbing/falling page movement patterns
https://en.wikipedia.org/wiki/Manche
ster_Arena_bombing: Some URIs
persist over time within the same
page
May 25, 2017 July 15, 2017
URIs
https://www.rollingstone.com/music/news/manchester-bo
mbing-what-we-know-about-arena-terror-attack-w483752:

Start early,
don’t stop!
34
Scraping SERPs for seeds?
@acnwala
JCDL 2018 • June 5, 2018

Conclusions
35
● Collection building offers a way of preserving the historic record of important events
and begin with seeds.
● Search engines provide an opportunity to build collections or extract seeds, but tend
to provide the most recent documents.
● Our findings about the difficulty in “refinding” news stories suggests that collection
building efforts that utilize SERPs should be start early and persist.
@acnwala @webscidl
Thank you!
Access our research dataset of 151,602 (33,432 unique) links extracted from
the Google SERPs for over seven months:
https://github.com/anwala/SERPRefind
@acnwala
JCDL 2018 • June 5, 2018

Scraping SERPs For Archival Seeds - It Matters When You Start

Recommended

Recommended

More Related Content

Similar to Scraping SERPs For Archival Seeds - It Matters When You Start

Similar to Scraping SERPs For Archival Seeds - It Matters When You Start (20)

More from Alexander Nwala

More from Alexander Nwala (6)

Recently uploaded

Recently uploaded (20)

Scraping SERPs For Archival Seeds - It Matters When You Start