SlideShare a Scribd company logo
Scraping SERPs for Archival Seeds:
It Matters When You Start
1
Alexander C. Nwala, Michele C. Weigle, and Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
@acnwala • @WebSciDL
Joint Conference on Digital Libraries (JCDL)
June 5, 2018, Fort Worth, TX
This work was made possible in
part by IMLS LG-71-15-0077-15
Scraping SERPs for Archival Seeds:
It Matters When You Start
2
Thank you SIGIR for the Travel Grant
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
Outline
1. Introduction and Motivation
2. Research questions
3. Methodology
a. Dataset generation, representation, and processing
b. Primitive measures extraction
4. Results
5. Conclusions
3
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
In March 2014, there was a serious outbreak of Ebola in West Africa
1 https://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/
4
The outbreak severely
affected Guinea, Liberia,
and Sierra Leone with
about 11,000 deaths1
.
http://wayback.archive-it.org/4887/20141028153039/http://blogs.msf.org/en/staff/blogs/msf-ebola-blog
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
5
http://wayback.archive-it.org/4887/20141022093244/http://blog.usaid.gov/ebola/
A few months after the Ebola outbreak, an Archivist at the National Library of
Medicine (NLM) collected seeds on Archive-It for the Ebola virus outbreak.
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
6
Archive-It Ebola virus seeds
http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/
http://www.cdc.gov/vhf/ebola/pdf/facts-about-ebola-french.pdf
http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html
http://www.cdc.gov/vhf/ebola/french/2014-west-africa/previous-updates.html
http://www.cdc.gov/vhf/ebola/french/2014-west-africa/index.html
http://www.cdc.gov/vhf/ebola/exposure/monitoring-and-movement-of-persons-with-exposure.html
http://www.cdc.gov/mmwr/ebola_reports.html
http://www.cdc.gov/media/DPK/2014/dpk-ebola-outbreak.html
http://www.acf.hhs.gov/programs/ohsepr/resource/ebola-planning-considerations
http://healthycanadians.gc.ca/diseases-conditions-maladies-affections/disease-maladie/ebola/index-eng.php
http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/whats-new.html
http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/index.html
http://nypost.com/2014/10/29/ebola-doctor-lied-about-his-nyc-travels-police/
http://www.npr.org/blogs/goatsandsoda/2014/08/20/341869218/if-salt-n-pepa-told-you-to-brush-your-teeth-youd-surely-listen/
http://www.doctorswithoutborders.org/article/msf-protocols-staff-returning-ebola-affected-countries/
http://blogs.plos.org/dnascience/2014/11/06/eman-reports-ebola-ground-zero/
http://blogs.plos.org/globalhealth/2014/11/ebola_and_human_rights/
http://www.philly.com/philly/blogs/public_health/Yellow-fever-and-Ebola-similar-scourges-centuries-apart.html
http://www.philly.com/philly/blogs/public_health/Syracuse-University-can-teach-us-a-lot-about-Ebola-panic.html
http://www.philly.com/philly/blogs/public_health/Why-not-ban-travel-to-stop-Ebola.html
http://www.pressherald.com/2014/10/17/fearing-ebola-strong-elementary-teacher-on-leave-after-traveling-to-dallas/
http://www.politico.com/magazine/story/2014/10/how-the-media-stoked-ebola-panic-112095.html
http://www.washingtonpost.com/news/post-nation/wp/2014/10/27/nurse-detained-under-new-jerseys-ebola-quarantine-to-be-released/?hpid=z1
http://blogs.scientificamerican.com/doing-good-science/2014/10/31/ebola-abundant-caution-and-sharing-a-world/
http://www.scientificamerican.com/article/ebola-exacerbates-west-africa-s-poverty-crisis/?WT.mc_id=SA_WR_20141105
http://www.scientificamerican.com/article/let-s-talk-about-ebola-survivors-and-sex/?WT.mc_id=SA_WR_20141105
http://www.who.int/mediacentre/news/statements/2014/ebola-20140808/en/
http://federalsoup.com/articles/2014/10/31/army-to-set-up-ebola-testing-labs-in-liberia.aspx
http://blogs.plos.org/speakingofmedicine/2014/10/22/ebola-taught-us-crucial-lesson-views-irrational-health-behaviors/
http://blogs.plos.org/speakingofmedicine/2014/10/31/social-pathways-ebola-virus-disease-rural-sierra-leone-implications-containment/
http://blogs.plos.org/speakingofmedicine/2014/10/31/rapid-response-ebola/
Sample of Archive-It
Ebola virus seeds
● A seed list is an initial collection exemplar web pages for a topic
○ seeds + linked pages form a collection when crawled
● Archived web collections consist of groups of web pages that share a
common topic e.g., “Ebola virus” and “2018 Winter Olympics.”
● Human-generated seeds are high-quality, but expensive to generate
7
Archived web collections begin with seeds
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
Archived web collections offer a way of preserving the
historic record of important events
8
http://xhosaculture.co.za/
Mandela’s legacy
https://www.wsj.com/
2016 Dakota Access Pipeline
http://www.nj.com/
2018 Winter Olympics
http://xhosaculture.co.za/
Mandela’s legacy
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
● The Internet Archive and Archive-It (a service of the Internet Archive) have on
multiple occasions requested that users submit seeds via Google Docs for:
9
Seeds may be generated by multiple users
10
Sample seeds
contributed
for the Boston
Marathon
Bombing
Collection
1. SOPA blackout (Jan 2012)
2. Hurricane Sandy (Aug 2012)
3. 2012 Occupy movement (May 2012)
4. Aaron Swartz (Jan 2013)
5. Supreme Court hearings DOMA (Mar 2013)
6. Boston Marathon Bombing (Apr 2013)
7. Nelson Mandela (Dec 2013)
8. 2014 Ferguson, MO (Aug 2014)
9. Ebola virus (Oct 2014)
10. 2016 U.S. presidential election (Nov 2016)
11. #DAPL (Dec 2016)
12. 2018 Winter Olympics (Feb 2018)
11
Tweet requests for
other collections
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
● Seeds can be discovered by issuing queries
(e.g. “hurricane harvey”) to Google and
extracting URIs from the SERP (Search
Engine Result Page)
● URIs for older news stories may be difficult
to discover via Google after one month
(research result)
12
Collection building often begins with a simple Google search
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
13
Before extracting seed URIs of news stories from
SERPs, we investigated re-finding URIs on Google.
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
14
News vertical
SERP
All (renamed General)
SERP
● Wikipedia page present
● Older documents
● Wikipedia page absent
● Newer documents (e.g., 2 hours old)
15
Initial stages of the AHCA bill and the
struggles to pass the bill
Later stages of the AHCA bill and the
failure of the bill which happened in
September 2017
Depending on query/topic, new pages displace older pages in SERPs:
query = "healthcare bill"
SERP on May 25, 2017 SERP on January 5, 2018 (7 months later)
Healthcare saga shaping GOP
approach to tax bill (thehill.com)
US Senate’s McConnell sees tough path
for passing healthcare bill (cnbc.com)
Will the Republican Health Care Bill
Really Lower Premiums? (time.com)
House Republicans used lessons from
failed health care bill to pass tax reform,
Ryan says (pbs.org)
GOP tax bill also manages to
needlessly screw up the healthcare
system (latimes.com)
How GOP tax bill’s Obamacare changes
will affect health care and consumers
(chicagotribune.com)
Outline
1. Introduction and Motivation
2. Research questions
3. Methodology
a. Dataset generation, representation, and processing
b. Primitive measures extraction
4. Results
5. Conclusions
16
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
● RQ1: What is the rate at which new URIs replace old URIs on the SERP over
time?
● RQ2: What is the probability of finding the same URI with the same query
over time?
17
Primary research questions
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
Outline
1. Introduction and Motivation
2. Research questions
3. Methodology
a. Dataset generation, representation, and processing
b. Primitive measures extraction
4. Results
5. Conclusions
18
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
● For a 7 month period (May 25, 2017 to January 12, 2018) we issued 7
queries every day and extracted URIs from the first 5 SERPs (General & News
Vertical):
1. “healthcare bill”
2. “manchester bombing”
3. “london terrorism”
4. “trump russia”
5. “travel ban”
6. “hurricane harvey”
7. “hurricane irma”
19
Methodology: Dataset generation, representation, and processing
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
20
● Dataset generated by
extracting URIs from SERPs
for seven queries
● Dataset was
semi-automatically
generated with the
http://www.localmemory.org/
collection generator
chrome extension
151,602 URIs
(33,432 unique)
21
Tracking URIs: single query perspective
1: Query issued
2: URIs extracted
Scheme and query
parameters removed to
track URIs
3: URI info
stored in
JSON files
4: Date
URI was
found,
page, etc.
● URI replacement rate, new URI rate, and page-level new URI rate
● Probability of finding a story
● Distribution of stories over time across pages
● Overlap rate and recall (see paper for details)
22
Retrievability of URIs was assessed by extracting
four measures
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
● SERP at t0
: URIs {a,b,c}
● SERP at t1
: URIs {a,b,x,y},
○ URI replacement rate at t1
is (at t1
c was replaced):
● SERP at t0
: URIs {a,b,c},
● SERP at t1
: URIs {a,b,c,d,e},
○ The new URI rate from t0
to t1
is (at t1
we saw new URI d and e):
23
Example: URI replacement rate and new URI rate
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
24
Probability of finding a URI over time and
distribution of URIs over time across pages
3 URIs
Day 1 Day 2 Day 3 Day 4 Day 5
URI-1 4 2 0 0 0
URI-2 1 2 1 0 0
URI-3 1 1 1 1 0
Probabilities 3/3 3/3 2/3 1/3 0/3
URI-1 found on
page 4 on Day 1
URIs 1-3,
NOT found
(pages 1-5)
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
Outline
1. Introduction and Motivation
2. Research questions
3. Methodology
a. Dataset generation, representation, and processing
b. Primitive measures extraction
4. Results
5. Conclusions
25
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
26
General SERP collections had lower new URI rates, thus produced
URIs with a longer lifespan than News vertical SERP collections
Hurricane Harvey
General SERP News Vertical SERP
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
27
Probability of finding the same URI with the same query on
News vertical SERP after 1 month ≈ 0
Hurricane Harvey
● URIs of some news stories
may not be easily discoverable
if query is issued after 1 month:
○ It matters when users
search for URIs
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
28
The “life span” of URIs is
dependent not just on
SERPs, but also topics
The URIs in “hurricane
harvey” had a “longer life”
than “trump russia” due to a
lower rate of new URIs
General
URIs
News
Vertical URIs
News
Vertical URIs
General
URIs
hurricane harvey
trump russia
Results: the URI
replacement and new
URI rates are strongly
dependent on the topic
E.g., Hurricane Harvey’s
lower daily avg. URI
replacement rate (0.21)
and avg. new URI rate
(0.21)
<
Trump Russia (highest
daily - monthly avg. URI
replacement and avg.
new URI rates)
29
Average URI replacement rate (column markers: min-
, max+
)
Average new URI rate (column markers: min-
, max+
)
30
The probability of finding the same URI of a news story with the
same query decreased with time for both SERPs
● The probability of finding the URI for a news story when the same query is issued
one day after it was first observed:
○ 0.34 - 0.44 (General) vs 0.28 - 0.40 (News Vertical)
● After one week:
○ Weekly: 0.01 - 0.11 (General) vs 0.03 - 0.14 (News Vertical)
● After one month:
○ Monthly: 0.01 - 0.08 (General) vs 0 (News Vertical)
● We fitted a curve over the union of occurrence of the URIs in our dataset with
an exponential model.
● The probability of finding an arbitrary URI of a news story s on a
SERP sp ∈ {General, News Vertical}, after k days is predicted as follows:
31
Generalization of the probability of finding an
arbitrary URI as a function of time (days)
32
URIs show multiple page movement patterns
● Each box represents a URI,
numbers in boxes represent the
page the URI was found:
● https://en.wikipedia.org/wiki/Ma
nchester_Arena_bombing (page
1)
● Color codes:
○ Page 1
○ Page 2
○ Page 3
○ Page 4
○ Page 5
○ White (outside pages 1-5)
May 25, 2017 July 15, 2017
URIs
Rapid/steady rank climbing and falling:
● Rapid climb: Some URIs go from page
5 to 1 (skipping pages 4 - 2),
● Rapid fall: or go from 1 to 5.
● Steady fall: or go from 3 - 2 - 1
33
Persistent vs rank climbing/falling page movement patterns
https://en.wikipedia.org/wiki/Manche
ster_Arena_bombing: Some URIs
persist over time within the same
page
May 25, 2017 July 15, 2017
URIs
https://www.rollingstone.com/music/news/manchester-bo
mbing-what-we-know-about-arena-terror-attack-w483752:
Start early,
don’t stop!
34
Scraping SERPs for seeds?
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
Conclusions
35
● Collection building offers a way of preserving the historic record of important events
and begin with seeds.
● Search engines provide an opportunity to build collections or extract seeds, but tend
to provide the most recent documents.
● Our findings about the difficulty in “refinding” news stories suggests that collection
building efforts that utilize SERPs should be start early and persist.
@acnwala @webscidl
Thank you!
Access our research dataset of 151,602 (33,432 unique) links extracted from
the Google SERPs for over seven months:
https://github.com/anwala/SERPRefind
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018

More Related Content

Similar to Scraping SERPs For Archival Seeds - It Matters When You Start

Big Data and Data Science: Opportunities for Biomedical Engineering
Big Data and Data Science: Opportunities for Biomedical EngineeringBig Data and Data Science: Opportunities for Biomedical Engineering
Big Data and Data Science: Opportunities for Biomedical Engineering
Philip Bourne
 
Developing a self-care protocol for working with potentially traumatic data: ...
Developing a self-care protocol for working with potentially traumatic data: ...Developing a self-care protocol for working with potentially traumatic data: ...
Developing a self-care protocol for working with potentially traumatic data: ...
dri_ireland
 
An Example of a Lean Startup: The Past, Present and Future
An Example of a Lean Startup: The Past, Present and FutureAn Example of a Lean Startup: The Past, Present and Future
An Example of a Lean Startup: The Past, Present and Future
Sean Ekins
 
Bowdoin: Data Driven Socities 2014 - On Digital Publics of Opening…or Not 2/1...
Bowdoin: Data Driven Socities 2014 - On Digital Publics of Opening…or Not 2/1...Bowdoin: Data Driven Socities 2014 - On Digital Publics of Opening…or Not 2/1...
Bowdoin: Data Driven Socities 2014 - On Digital Publics of Opening…or Not 2/1...
Department of Geography, University of Kentucky
 
Big Data Analytics and Open Data
Big Data Analytics and Open Data Big Data Analytics and Open Data
Big Data Analytics and Open Data
Sharjeel Imtiaz
 
Big Data — Your new best friend
Big Data — Your new best friendBig Data — Your new best friend
Big Data — Your new best friend
Reuven Lerner
 
Innovative project1
Innovative project1Innovative project1
Innovative project1
LillySheebaS1
 
Data Journalism 101: A Brief Survey
Data Journalism 101: A Brief SurveyData Journalism 101: A Brief Survey
Data Journalism 101: A Brief Survey
Flex.io
 
[AgBioData] Genome nomenclature 09-05-2018
[AgBioData] Genome nomenclature 09-05-2018[AgBioData] Genome nomenclature 09-05-2018
[AgBioData] Genome nomenclature 09-05-2018
AgBioData
 
Enabling Personal Use of Web Archives
Enabling Personal Use of Web ArchivesEnabling Personal Use of Web Archives
Enabling Personal Use of Web Archives
Michele Weigle
 
Lessons learnt from the GCP experience – J-M Ribaut
Lessons learnt from the GCP experience – J-M RibautLessons learnt from the GCP experience – J-M Ribaut
Lessons learnt from the GCP experience – J-M Ribaut
CGIAR Generation Challenge Programme
 
First annual scientific conference - overview
First annual scientific conference - overviewFirst annual scientific conference - overview
First annual scientific conference - overview
IFPRI-PIM
 
First annual scientific conference - overview
First annual scientific conference - overviewFirst annual scientific conference - overview
First annual scientific conference - overview
CGIAR
 
Big Data and Data Science W's
Big Data and Data Science W'sBig Data and Data Science W's
Big Data and Data Science W's
Emanuele Della Valle
 
Ouellette elixir 2017
Ouellette elixir 2017Ouellette elixir 2017
Ouellette elixir 2017
Neuro, McGill University
 
NCI Support for Cancer Data Sharing
NCI Support for Cancer Data SharingNCI Support for Cancer Data Sharing
NCI Support for Cancer Data Sharing
Warren Kibbe
 
What Does Open Data Mean to Data Science
What Does Open Data Mean to Data ScienceWhat Does Open Data Mean to Data Science
What Does Open Data Mean to Data Science
Philip Bourne
 
Arts Council England Environmental Reporting 2015/16: Grant Holders
Arts Council England Environmental Reporting 2015/16: Grant HoldersArts Council England Environmental Reporting 2015/16: Grant Holders
Arts Council England Environmental Reporting 2015/16: Grant Holders
Julie's Bicycle
 
Better Data for a Better World
Better Data for a Better WorldBetter Data for a Better World
Better Data for a Better World
Rothamsted Research, UK
 

Similar to Scraping SERPs For Archival Seeds - It Matters When You Start (20)

Big Data and Data Science: Opportunities for Biomedical Engineering
Big Data and Data Science: Opportunities for Biomedical EngineeringBig Data and Data Science: Opportunities for Biomedical Engineering
Big Data and Data Science: Opportunities for Biomedical Engineering
 
Developing a self-care protocol for working with potentially traumatic data: ...
Developing a self-care protocol for working with potentially traumatic data: ...Developing a self-care protocol for working with potentially traumatic data: ...
Developing a self-care protocol for working with potentially traumatic data: ...
 
An Example of a Lean Startup: The Past, Present and Future
An Example of a Lean Startup: The Past, Present and FutureAn Example of a Lean Startup: The Past, Present and Future
An Example of a Lean Startup: The Past, Present and Future
 
Bowdoin: Data Driven Socities 2014 - On Digital Publics of Opening…or Not 2/1...
Bowdoin: Data Driven Socities 2014 - On Digital Publics of Opening…or Not 2/1...Bowdoin: Data Driven Socities 2014 - On Digital Publics of Opening…or Not 2/1...
Bowdoin: Data Driven Socities 2014 - On Digital Publics of Opening…or Not 2/1...
 
Big Data Analytics and Open Data
Big Data Analytics and Open Data Big Data Analytics and Open Data
Big Data Analytics and Open Data
 
Big Data — Your new best friend
Big Data — Your new best friendBig Data — Your new best friend
Big Data — Your new best friend
 
Innovative project1
Innovative project1Innovative project1
Innovative project1
 
Data Journalism 101: A Brief Survey
Data Journalism 101: A Brief SurveyData Journalism 101: A Brief Survey
Data Journalism 101: A Brief Survey
 
[AgBioData] Genome nomenclature 09-05-2018
[AgBioData] Genome nomenclature 09-05-2018[AgBioData] Genome nomenclature 09-05-2018
[AgBioData] Genome nomenclature 09-05-2018
 
Enabling Personal Use of Web Archives
Enabling Personal Use of Web ArchivesEnabling Personal Use of Web Archives
Enabling Personal Use of Web Archives
 
Lessons learnt from the GCP experience – J-M Ribaut
Lessons learnt from the GCP experience – J-M RibautLessons learnt from the GCP experience – J-M Ribaut
Lessons learnt from the GCP experience – J-M Ribaut
 
First annual scientific conference - overview
First annual scientific conference - overviewFirst annual scientific conference - overview
First annual scientific conference - overview
 
First annual scientific conference - overview
First annual scientific conference - overviewFirst annual scientific conference - overview
First annual scientific conference - overview
 
Big Data and Data Science W's
Big Data and Data Science W'sBig Data and Data Science W's
Big Data and Data Science W's
 
Ouellette elixir 2017
Ouellette elixir 2017Ouellette elixir 2017
Ouellette elixir 2017
 
NCI Support for Cancer Data Sharing
NCI Support for Cancer Data SharingNCI Support for Cancer Data Sharing
NCI Support for Cancer Data Sharing
 
What Does Open Data Mean to Data Science
What Does Open Data Mean to Data ScienceWhat Does Open Data Mean to Data Science
What Does Open Data Mean to Data Science
 
Arts Council England Environmental Reporting 2015/16: Grant Holders
Arts Council England Environmental Reporting 2015/16: Grant HoldersArts Council England Environmental Reporting 2015/16: Grant Holders
Arts Council England Environmental Reporting 2015/16: Grant Holders
 
Better Data for a Better World
Better Data for a Better WorldBetter Data for a Better World
Better Data for a Better World
 
Standing Panel on Impact Assessment Doug Gollin
Standing Panel on Impact Assessment Doug GollinStanding Panel on Impact Assessment Doug Gollin
Standing Panel on Impact Assessment Doug Gollin
 

More from Alexander Nwala

Local Memory Project
Local Memory ProjectLocal Memory Project
Local Memory Project
Alexander Nwala
 
Tweet Visibility Dynamics in a Tweet Conversation Graph
Tweet Visibility Dynamics in a Tweet Conversation GraphTweet Visibility Dynamics in a Tweet Conversation Graph
Tweet Visibility Dynamics in a Tweet Conversation Graph
Alexander Nwala
 
Generating collections for stories and events
Generating collections for stories and eventsGenerating collections for stories and events
Generating collections for stories and events
Alexander Nwala
 
Jcdl2016_keynote-zemankova
Jcdl2016_keynote-zemankovaJcdl2016_keynote-zemankova
Jcdl2016_keynote-zemankova
Alexander Nwala
 
Tracking discourse on social media
Tracking discourse on social mediaTracking discourse on social media
Tracking discourse on social media
Alexander Nwala
 
Information Visualization Project
Information Visualization ProjectInformation Visualization Project
Information Visualization Project
Alexander Nwala
 

More from Alexander Nwala (6)

Local Memory Project
Local Memory ProjectLocal Memory Project
Local Memory Project
 
Tweet Visibility Dynamics in a Tweet Conversation Graph
Tweet Visibility Dynamics in a Tweet Conversation GraphTweet Visibility Dynamics in a Tweet Conversation Graph
Tweet Visibility Dynamics in a Tweet Conversation Graph
 
Generating collections for stories and events
Generating collections for stories and eventsGenerating collections for stories and events
Generating collections for stories and events
 
Jcdl2016_keynote-zemankova
Jcdl2016_keynote-zemankovaJcdl2016_keynote-zemankova
Jcdl2016_keynote-zemankova
 
Tracking discourse on social media
Tracking discourse on social mediaTracking discourse on social media
Tracking discourse on social media
 
Information Visualization Project
Information Visualization ProjectInformation Visualization Project
Information Visualization Project
 

Recently uploaded

Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 

Recently uploaded (20)

Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 

Scraping SERPs For Archival Seeds - It Matters When You Start

  • 1. Scraping SERPs for Archival Seeds: It Matters When You Start 1
  • 2. Alexander C. Nwala, Michele C. Weigle, and Michael L. Nelson Old Dominion University Web Science & Digital Libraries Research Group @acnwala • @WebSciDL Joint Conference on Digital Libraries (JCDL) June 5, 2018, Fort Worth, TX This work was made possible in part by IMLS LG-71-15-0077-15 Scraping SERPs for Archival Seeds: It Matters When You Start 2 Thank you SIGIR for the Travel Grant @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 3. Outline 1. Introduction and Motivation 2. Research questions 3. Methodology a. Dataset generation, representation, and processing b. Primitive measures extraction 4. Results 5. Conclusions 3 @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 4. In March 2014, there was a serious outbreak of Ebola in West Africa 1 https://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/ 4 The outbreak severely affected Guinea, Liberia, and Sierra Leone with about 11,000 deaths1 . http://wayback.archive-it.org/4887/20141028153039/http://blogs.msf.org/en/staff/blogs/msf-ebola-blog @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 5. 5 http://wayback.archive-it.org/4887/20141022093244/http://blog.usaid.gov/ebola/ A few months after the Ebola outbreak, an Archivist at the National Library of Medicine (NLM) collected seeds on Archive-It for the Ebola virus outbreak. @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 6. 6 Archive-It Ebola virus seeds http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/ http://www.cdc.gov/vhf/ebola/pdf/facts-about-ebola-french.pdf http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html http://www.cdc.gov/vhf/ebola/french/2014-west-africa/previous-updates.html http://www.cdc.gov/vhf/ebola/french/2014-west-africa/index.html http://www.cdc.gov/vhf/ebola/exposure/monitoring-and-movement-of-persons-with-exposure.html http://www.cdc.gov/mmwr/ebola_reports.html http://www.cdc.gov/media/DPK/2014/dpk-ebola-outbreak.html http://www.acf.hhs.gov/programs/ohsepr/resource/ebola-planning-considerations http://healthycanadians.gc.ca/diseases-conditions-maladies-affections/disease-maladie/ebola/index-eng.php http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/whats-new.html http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/index.html http://nypost.com/2014/10/29/ebola-doctor-lied-about-his-nyc-travels-police/ http://www.npr.org/blogs/goatsandsoda/2014/08/20/341869218/if-salt-n-pepa-told-you-to-brush-your-teeth-youd-surely-listen/ http://www.doctorswithoutborders.org/article/msf-protocols-staff-returning-ebola-affected-countries/ http://blogs.plos.org/dnascience/2014/11/06/eman-reports-ebola-ground-zero/ http://blogs.plos.org/globalhealth/2014/11/ebola_and_human_rights/ http://www.philly.com/philly/blogs/public_health/Yellow-fever-and-Ebola-similar-scourges-centuries-apart.html http://www.philly.com/philly/blogs/public_health/Syracuse-University-can-teach-us-a-lot-about-Ebola-panic.html http://www.philly.com/philly/blogs/public_health/Why-not-ban-travel-to-stop-Ebola.html http://www.pressherald.com/2014/10/17/fearing-ebola-strong-elementary-teacher-on-leave-after-traveling-to-dallas/ http://www.politico.com/magazine/story/2014/10/how-the-media-stoked-ebola-panic-112095.html http://www.washingtonpost.com/news/post-nation/wp/2014/10/27/nurse-detained-under-new-jerseys-ebola-quarantine-to-be-released/?hpid=z1 http://blogs.scientificamerican.com/doing-good-science/2014/10/31/ebola-abundant-caution-and-sharing-a-world/ http://www.scientificamerican.com/article/ebola-exacerbates-west-africa-s-poverty-crisis/?WT.mc_id=SA_WR_20141105 http://www.scientificamerican.com/article/let-s-talk-about-ebola-survivors-and-sex/?WT.mc_id=SA_WR_20141105 http://www.who.int/mediacentre/news/statements/2014/ebola-20140808/en/ http://federalsoup.com/articles/2014/10/31/army-to-set-up-ebola-testing-labs-in-liberia.aspx http://blogs.plos.org/speakingofmedicine/2014/10/22/ebola-taught-us-crucial-lesson-views-irrational-health-behaviors/ http://blogs.plos.org/speakingofmedicine/2014/10/31/social-pathways-ebola-virus-disease-rural-sierra-leone-implications-containment/ http://blogs.plos.org/speakingofmedicine/2014/10/31/rapid-response-ebola/ Sample of Archive-It Ebola virus seeds
  • 7. ● A seed list is an initial collection exemplar web pages for a topic ○ seeds + linked pages form a collection when crawled ● Archived web collections consist of groups of web pages that share a common topic e.g., “Ebola virus” and “2018 Winter Olympics.” ● Human-generated seeds are high-quality, but expensive to generate 7 Archived web collections begin with seeds @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 8. Archived web collections offer a way of preserving the historic record of important events 8 http://xhosaculture.co.za/ Mandela’s legacy https://www.wsj.com/ 2016 Dakota Access Pipeline http://www.nj.com/ 2018 Winter Olympics http://xhosaculture.co.za/ Mandela’s legacy @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 9. ● The Internet Archive and Archive-It (a service of the Internet Archive) have on multiple occasions requested that users submit seeds via Google Docs for: 9 Seeds may be generated by multiple users
  • 10. 10 Sample seeds contributed for the Boston Marathon Bombing Collection
  • 11. 1. SOPA blackout (Jan 2012) 2. Hurricane Sandy (Aug 2012) 3. 2012 Occupy movement (May 2012) 4. Aaron Swartz (Jan 2013) 5. Supreme Court hearings DOMA (Mar 2013) 6. Boston Marathon Bombing (Apr 2013) 7. Nelson Mandela (Dec 2013) 8. 2014 Ferguson, MO (Aug 2014) 9. Ebola virus (Oct 2014) 10. 2016 U.S. presidential election (Nov 2016) 11. #DAPL (Dec 2016) 12. 2018 Winter Olympics (Feb 2018) 11 Tweet requests for other collections @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 12. ● Seeds can be discovered by issuing queries (e.g. “hurricane harvey”) to Google and extracting URIs from the SERP (Search Engine Result Page) ● URIs for older news stories may be difficult to discover via Google after one month (research result) 12 Collection building often begins with a simple Google search @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 13. 13 Before extracting seed URIs of news stories from SERPs, we investigated re-finding URIs on Google. @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 14. 14 News vertical SERP All (renamed General) SERP ● Wikipedia page present ● Older documents ● Wikipedia page absent ● Newer documents (e.g., 2 hours old)
  • 15. 15 Initial stages of the AHCA bill and the struggles to pass the bill Later stages of the AHCA bill and the failure of the bill which happened in September 2017 Depending on query/topic, new pages displace older pages in SERPs: query = "healthcare bill" SERP on May 25, 2017 SERP on January 5, 2018 (7 months later) Healthcare saga shaping GOP approach to tax bill (thehill.com) US Senate’s McConnell sees tough path for passing healthcare bill (cnbc.com) Will the Republican Health Care Bill Really Lower Premiums? (time.com) House Republicans used lessons from failed health care bill to pass tax reform, Ryan says (pbs.org) GOP tax bill also manages to needlessly screw up the healthcare system (latimes.com) How GOP tax bill’s Obamacare changes will affect health care and consumers (chicagotribune.com)
  • 16. Outline 1. Introduction and Motivation 2. Research questions 3. Methodology a. Dataset generation, representation, and processing b. Primitive measures extraction 4. Results 5. Conclusions 16 @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 17. ● RQ1: What is the rate at which new URIs replace old URIs on the SERP over time? ● RQ2: What is the probability of finding the same URI with the same query over time? 17 Primary research questions @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 18. Outline 1. Introduction and Motivation 2. Research questions 3. Methodology a. Dataset generation, representation, and processing b. Primitive measures extraction 4. Results 5. Conclusions 18 @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 19. ● For a 7 month period (May 25, 2017 to January 12, 2018) we issued 7 queries every day and extracted URIs from the first 5 SERPs (General & News Vertical): 1. “healthcare bill” 2. “manchester bombing” 3. “london terrorism” 4. “trump russia” 5. “travel ban” 6. “hurricane harvey” 7. “hurricane irma” 19 Methodology: Dataset generation, representation, and processing @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 20. 20 ● Dataset generated by extracting URIs from SERPs for seven queries ● Dataset was semi-automatically generated with the http://www.localmemory.org/ collection generator chrome extension 151,602 URIs (33,432 unique)
  • 21. 21 Tracking URIs: single query perspective 1: Query issued 2: URIs extracted Scheme and query parameters removed to track URIs 3: URI info stored in JSON files 4: Date URI was found, page, etc.
  • 22. ● URI replacement rate, new URI rate, and page-level new URI rate ● Probability of finding a story ● Distribution of stories over time across pages ● Overlap rate and recall (see paper for details) 22 Retrievability of URIs was assessed by extracting four measures @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 23. ● SERP at t0 : URIs {a,b,c} ● SERP at t1 : URIs {a,b,x,y}, ○ URI replacement rate at t1 is (at t1 c was replaced): ● SERP at t0 : URIs {a,b,c}, ● SERP at t1 : URIs {a,b,c,d,e}, ○ The new URI rate from t0 to t1 is (at t1 we saw new URI d and e): 23 Example: URI replacement rate and new URI rate @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 24. 24 Probability of finding a URI over time and distribution of URIs over time across pages 3 URIs Day 1 Day 2 Day 3 Day 4 Day 5 URI-1 4 2 0 0 0 URI-2 1 2 1 0 0 URI-3 1 1 1 1 0 Probabilities 3/3 3/3 2/3 1/3 0/3 URI-1 found on page 4 on Day 1 URIs 1-3, NOT found (pages 1-5) @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 25. Outline 1. Introduction and Motivation 2. Research questions 3. Methodology a. Dataset generation, representation, and processing b. Primitive measures extraction 4. Results 5. Conclusions 25 @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 26. 26 General SERP collections had lower new URI rates, thus produced URIs with a longer lifespan than News vertical SERP collections Hurricane Harvey General SERP News Vertical SERP @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 27. 27 Probability of finding the same URI with the same query on News vertical SERP after 1 month ≈ 0 Hurricane Harvey ● URIs of some news stories may not be easily discoverable if query is issued after 1 month: ○ It matters when users search for URIs @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 28. 28 The “life span” of URIs is dependent not just on SERPs, but also topics The URIs in “hurricane harvey” had a “longer life” than “trump russia” due to a lower rate of new URIs General URIs News Vertical URIs News Vertical URIs General URIs hurricane harvey trump russia
  • 29. Results: the URI replacement and new URI rates are strongly dependent on the topic E.g., Hurricane Harvey’s lower daily avg. URI replacement rate (0.21) and avg. new URI rate (0.21) < Trump Russia (highest daily - monthly avg. URI replacement and avg. new URI rates) 29 Average URI replacement rate (column markers: min- , max+ ) Average new URI rate (column markers: min- , max+ )
  • 30. 30 The probability of finding the same URI of a news story with the same query decreased with time for both SERPs ● The probability of finding the URI for a news story when the same query is issued one day after it was first observed: ○ 0.34 - 0.44 (General) vs 0.28 - 0.40 (News Vertical) ● After one week: ○ Weekly: 0.01 - 0.11 (General) vs 0.03 - 0.14 (News Vertical) ● After one month: ○ Monthly: 0.01 - 0.08 (General) vs 0 (News Vertical)
  • 31. ● We fitted a curve over the union of occurrence of the URIs in our dataset with an exponential model. ● The probability of finding an arbitrary URI of a news story s on a SERP sp ∈ {General, News Vertical}, after k days is predicted as follows: 31 Generalization of the probability of finding an arbitrary URI as a function of time (days)
  • 32. 32 URIs show multiple page movement patterns ● Each box represents a URI, numbers in boxes represent the page the URI was found: ● https://en.wikipedia.org/wiki/Ma nchester_Arena_bombing (page 1) ● Color codes: ○ Page 1 ○ Page 2 ○ Page 3 ○ Page 4 ○ Page 5 ○ White (outside pages 1-5) May 25, 2017 July 15, 2017 URIs
  • 33. Rapid/steady rank climbing and falling: ● Rapid climb: Some URIs go from page 5 to 1 (skipping pages 4 - 2), ● Rapid fall: or go from 1 to 5. ● Steady fall: or go from 3 - 2 - 1 33 Persistent vs rank climbing/falling page movement patterns https://en.wikipedia.org/wiki/Manche ster_Arena_bombing: Some URIs persist over time within the same page May 25, 2017 July 15, 2017 URIs https://www.rollingstone.com/music/news/manchester-bo mbing-what-we-know-about-arena-terror-attack-w483752:
  • 34. Start early, don’t stop! 34 Scraping SERPs for seeds? @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018
  • 35. Conclusions 35 ● Collection building offers a way of preserving the historic record of important events and begin with seeds. ● Search engines provide an opportunity to build collections or extract seeds, but tend to provide the most recent documents. ● Our findings about the difficulty in “refinding” news stories suggests that collection building efforts that utilize SERPs should be start early and persist. @acnwala @webscidl Thank you! Access our research dataset of 151,602 (33,432 unique) links extracted from the Google SERPs for over seven months: https://github.com/anwala/SERPRefind @acnwala Scraping SERPs for Archival Seeds: It Matters When You Start JCDL 2018 • June 5, 2018