SlideShare a Scribd company logo
1 of 43
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
Creating Topical Collections:
Web Archives vs. Live Web
Martin Klein
@mart1nkle1n
Research Library
Los Alamos National Laboratory
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
2
Team Work
Lyudmila BalakirevaHerbert Van de Sompel
@hvdsomp
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
3
• Live web is dynamic, lives in a “perpetual now”
• Subject to link rot and content drift ( = reference rot)*
• Significant platform/source for news publication/consumption
Background - Live Web
http://archive.is/FhdK6
• Pew Research Center survey
from August 2017:
• 43% often get news online
• 50% often get news from TV
• 38% and 57% in early 2016
*See:
https://doi.org/10.1371/journal.pone.0115253
https://doi.org/10.1371/journal.pone.0167475
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
4
• Often orchestrated by subject matter experts, archivists,
special collection librarians, technicians
• Potentially with guidance from institutional collection policy
• Results in a list of seeds (URIs, social media accounts, etc)
• Utilization of crawling services such as Archive-It, Social Feed
Manager
• Relevance of seeds assessed by humans
• Time passed since event is a concern because:
• Stories evolve
• Reference rot
• API restrictions
Background – Collection Building from the Live Web
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
5
• Web archives are an invaluable resource for researchers,
historians, journalists, etc.
• Often broad in scope, large in scale, covering different
temporal intervals
• Makes discovery, access, and analysis difficult
• In particular, for topic-specific resources
Background – Archived Web
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
6
Memento allows to access many web archives, simultaneously!
Access to the Archived Web
http://timetravel.mementoweb.org/
http://mementoweb.org/guide/rfc/
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
7
<Intermezzo>
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
8
Web Crawling
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
9
Web Crawling
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
10
Web Crawling
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
11
Web Crawling
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
12
Focused Crawling
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Not crawled
Crawled and
not relevant
Crawled and
relevant
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
13
Focused Crawling
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Not crawled
Crawled and
not relevant
Crawled and
relevant
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
14
Focused Crawling
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Not crawled
Crawled and
not relevant
Crawled and
relevant
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
15
</Intermezzo>
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
16
Inspiration from Previous Work
https://doi.org/10.1007/978-3-319-67008-9_10
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
17
• Extract event-centric document collections via focused
crawling of an archive
• Archive = web pages from .de top-level domain, captured by
the Internet Archive until 2013 (30TB, 4b captures, 1b URIs)
• Identified 28 topics, likely covered in archive
• Text of topics’ Wikipedia page used for content relevance
evaluation
• Crawled page datetime used for temporal relevance
evaluation
• Overall relevance = content relevance + temporal relevance
• Wikipedia page outlinks used as seeds for focused crawl
Previous Work - Setup
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
18
Previous Work – Results
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
19
• Can we create high-quality topical collections by focused
crawling online-available web archives?
• What is the effect of including multiple archives in the crawl?
• How do collections created from the archived web compare to
those created from the live web?
• How does the amount of time passed since the event affect
the quality of the collection?
Our Questions
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
20
• Topics limited to terror attacks and mass shootings in the U.S.
• From different times in the past
• Focused crawl of:
• 22 archives, simultaneously, via Memento infrastructure
• the live web
• Take content and temporal relevance into account
• Equally weighted: R = (0.5 x CR) + (0.5 x TR)
• Use events’ Wikipedia page as input for focused crawler
Our Experiment
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
21
1. Content of Wikipedia page + random 60% of page’s references
• Generate topic vector (TF-IDF of 1grams + 2grams)
2. Content of remaining 40% of Wikipedia page’s outlinks
• Generate topic vector (TF-IDF of 1grams + 2grams)
• Compute cosine similarity value between vectors 1 and 2
• Run 10 times
• Take average similarity value as content threshold
Content Relevance
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
22
• Define temporal interval for which crawled pages are
considered relevant
• Event date extracted from Wikipedia event page
• Change point determined from graph of proportional
Wikipedia page edits per day
Temporal Relevance
1
Event Date Change Point Today
0 0
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
23
Change Point Detection
2016−06−12 2016−11−05 2017−03−31 2017−08−24
020406080100
Edit Dates
Percentage
46
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
24
• Extract datetime from pages via:
• URI
http://www.cnn.com/2017/12/09/us/wildfire-fighting-tactics/
• Meta tags
<meta property="article:published" itemprop="datePublished"
content="2017-12-09T10:14:50-05:00" />
• ODU’s Carbondate tool
http://carbondate.cs.odu.edu/
• Memento datetime
• X-Header
Datetime Extraction
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
25
• Use version of Wikipedia page that was live at change point
• Possible crawl stop conditions:
• Total number of documents crawled
• Accumulated size of crawled documents
• Time elapsed since crawl started
• Crawl x levels deep
• No more relevant documents left
• Our pick:
• Crawl relevant documents
• 5 levels deep
• with priority queue
Crawls
Level 2
Level 1
Level 0
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
26
• New York City, October 31st 2017
• Las Vegas, October 1st 2017
• Orlando, June 12th 2016
• San Bernadino, December 2nd 2015
• Tucson, January 8th 2011
• Binghampton, April 3rd 2009
Collections Crawled (in late November)
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
27
NYC, 10/31/2017 – URIs per Level
0 1 2 3 4 5
0500100015002000
0102030405060708090100
All URIs
Relevant URIs
0 1 2 3 4 5
0500100015002000
0102030405060708090100
All URIs
Relevant URIs
Archived Crawl Live Crawl
Levels Levels
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
28
Intermezzo – Focused Crawling
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Not crawled
Crawled and
not relevant
Crawled and
relevant
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
29
NYC, 10/31/2017 – Relevance over URIs
Relevant Documents All Crawled Documents
0 200 400 600 800
0100200300400500600
Documents
AccumulatedRelevance
Archived
Live
0 1000 2000 3000 4000 5000
050010001500
Documents
AccumulatedRelevance
Archived
Live
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
30
NYC, 10/31/2017 – Relevance over Crawl Time
Relevant Documents All Crawled Documents
0 50000 100000 150000 200000
0100200300400500600
Time in Seconds
AccumulatedRelevance
Archived
Live
0 50000 100000 150000 200000
050010001500
Time in Seconds
AccumulatedRelevance
Archived
Live
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
31
NYC, 10/31/2017 – Web Archive Distribution
0500100015002000
w
eb.archive.org
w
ayback.archive−it.org
archive.is
perm
a−archives.org
w
ebarchive.nationalarchives.gov.uk
All Mementos
Relevant Mementos
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
32
Binghampton, April 3rd 2009 – URIs per Level
Archived Crawl Live Crawl
Levels Levels
0 1 2 3 4 5
0200400600800100012001400
0102030405060708090100
All URIs
Relevant URIs
0 1 2 3 4
0200400600800100012001400
0102030405060708090100
All URIs
Relevant URIs
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
33
Binghampton, April 3rd 2009 – Relevance over URIs
Relevant Documents All Crawled Documents
0 100 200 300 400 500 600
0100200300400
Documents
AccumulatedRelevance
Archived
Live
0 1000 2000 3000 4000 5000 6000
050010001500200025003000
Documents
AccumulatedRelevance
Archived
Live
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
34
Binghampton, April 3rd 2009 – Relevance over Crawl Time
Relevant Documents All Crawled Documents
0 50000 100000 150000 200000 250000
0100200300400
Time in Seconds
AccumulatedRelevance
Archived
Live
0 50000 100000 150000 200000 250000
050010001500200025003000
Time in Seconds
AccumulatedRelevance
Archived
Live
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
35
Binghampton, April 3rd 2009 – Web Archive Distribution
01000200030004000
w
eb.archive.org
w
ebarchive.loc.gov
w
ayback.archive−it.org
arquivo.pt
sw
ap.stanford.edu
archive.is
w
eb.archive.bibalex.org:80
All Mementos
Relevant Mementos
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
36
San Bernadino, December 2nd 2015 – URIs per Level
Archived Crawl Live Crawl
Levels Levels
0 1 2 3 4 5
0500100015002000250030003500
0102030405060708090100
All URIs
Relevant URIs
0 1 2 3 4 5
0500100015002000250030003500
0102030405060708090100
All URIs
Relevant URIs
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
37
San Bernadino, December 2nd 2015 – Relevance over URIs
Relevant Documents All Crawled Documents
0 500 1000 1500 2000 2500
0500100015002000
Documents
AccumulatedRelevance
Archived
Live
0 2000 4000 6000 8000 10000 12000
010002000300040005000
Documents
AccumulatedRelevance
Archived
Live
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
38
San Bernadino, December 2nd 2015 – Relevance over Crawl Time
Relevant Documents All Crawled Documents
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
0500100015002000
Time in Seconds
AccumulatedRelevance
Archived
Live
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
010002000300040005000
Time in Seconds
AccumulatedRelevance
Archived
Live
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
39
San Bernadino, December 2nd 2015 – Web Archive Distribution
02000400060008000
w
eb.archive.org
w
ayback.archive−it.org
w
ebarchive.loc.govarchive.isarquivo.pt
w
ayback.vefsafn.is
collection.europarchive.org
perm
a−archives.org
digital.library.yorku.ca
w
ebarchive.org.uk
All Mementos
Relevant Mementos
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
40
• Web archives are good resources to build topical collections
of web resources
• Utilizing multiple web archives is beneficial for the collection
• Crawling web archives is much slower than the live web
• Collections about very recent events benefit more from the
live web than the archived web
but
• Collections about events from the distant past benefit more
from archives than the live web
but
• Collections about less recent events can (still) benefit from the
live web and (already) from the archived web
Take-Aways
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
41
• Forgive one level of “irrelevance”
• Compare with manually curated collections (from AIT)
• Diversify to international topics and beyond shootings
• Investigate questions of optimal start and end time of crawls
Where to go next
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
42
https://web.archive.org/web/20171206181955/https:/twitter.com/TVNewsArchive/status/938466726190096384
Creating Topical Collections: Web Archives vs. Live Web
@mart1nkle1n
CNI Fall Meeting 2017, 12/11/2017, Washington, DC
Creating Topical Collections:
Web Archives vs. Live Web
Martin Klein
@mart1nkle1n
Research Library
Los Alamos National Laboratory
0 1 2 3 4 5
0500100015002000250030003500
0102030405060708090100
All URIs
Relevant URIs
0 1 2 3 4 5
0500100015002000250030003500
0102030405060708090100
All URIs
Relevant URIs

More Related Content

What's hot

Quantifying Orphaned Annotations in Hypothes.is
Quantifying Orphaned Annotations in Hypothes.isQuantifying Orphaned Annotations in Hypothes.is
Quantifying Orphaned Annotations in Hypothes.ismaturban
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDMartin Klein
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Mat Kelly
 
Persistent Identification: Easier Said than Done
Persistent Identification: Easier Said than DonePersistent Identification: Easier Said than Done
Persistent Identification: Easier Said than DoneHerbert Van de Sompel
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsJustin Brunelle
 
Enabling Personal Use of Web Archives
Enabling Personal Use of Web ArchivesEnabling Personal Use of Web Archives
Enabling Personal Use of Web ArchivesMichele Weigle
 
Persistent Identifiers and the Web: The Need for an Unambiguous Mapping
Persistent Identifiers and the Web: The Need for an Unambiguous MappingPersistent Identifiers and the Web: The Need for an Unambiguous Mapping
Persistent Identifiers and the Web: The Need for an Unambiguous MappingHerbert Van de Sompel
 
Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013Herbert Van de Sompel
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesMichael Nelson
 
A Framework for Aggregating Private and Public Web Archives
A Framework for Aggregating Private and Public Web ArchivesA Framework for Aggregating Private and Public Web Archives
A Framework for Aggregating Private and Public Web Archivesjcdl2018
 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...Justin Brunelle
 
Linked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve MeyerLinked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve MeyerWiLS
 
Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?Emily Nimsakont
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...Martin Klein
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptMichael Nelson
 

What's hot (20)

The Web We Want
The Web We WantThe Web We Want
The Web We Want
 
Quantifying Orphaned Annotations in Hypothes.is
Quantifying Orphaned Annotations in Hypothes.isQuantifying Orphaned Annotations in Hypothes.is
Quantifying Orphaned Annotations in Hypothes.is
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
 
Creating Pockets of Persistence
Creating Pockets of PersistenceCreating Pockets of Persistence
Creating Pockets of Persistence
 
Signposting Overview
Signposting OverviewSignposting Overview
Signposting Overview
 
Reminiscing about interoperability
Reminiscing about interoperabilityReminiscing about interoperability
Reminiscing about interoperability
 
PID Signposting Pattern
PID Signposting PatternPID Signposting Pattern
PID Signposting Pattern
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 
Persistent Identification: Easier Said than Done
Persistent Identification: Easier Said than DonePersistent Identification: Easier Said than Done
Persistent Identification: Easier Said than Done
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
 
Enabling Personal Use of Web Archives
Enabling Personal Use of Web ArchivesEnabling Personal Use of Web Archives
Enabling Personal Use of Web Archives
 
Persistent Identifiers and the Web: The Need for an Unambiguous Mapping
Persistent Identifiers and the Web: The Need for an Unambiguous MappingPersistent Identifiers and the Web: The Need for an Unambiguous Mapping
Persistent Identifiers and the Web: The Need for an Unambiguous Mapping
 
Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
 
A Framework for Aggregating Private and Public Web Archives
A Framework for Aggregating Private and Public Web ArchivesA Framework for Aggregating Private and Public Web Archives
A Framework for Aggregating Private and Public Web Archives
 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
 
Linked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve MeyerLinked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve Meyer
 
Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 

Similar to Creating Topical Collections: Web Archives vs. Live Web

Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsMartin Klein
 
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...Martin Klein
 
Webinar: SOS Save Our Site! Archiving Web Content-2017-08-10
Webinar: SOS Save Our Site! Archiving Web Content-2017-08-10Webinar: SOS Save Our Site! Archiving Web Content-2017-08-10
Webinar: SOS Save Our Site! Archiving Web Content-2017-08-10TechSoup
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...Micah Altman
 
web_archive_interoperability_memento
web_archive_interoperability_mementoweb_archive_interoperability_memento
web_archive_interoperability_mementoMartin Klein
 
Generating collections for stories and events
Generating collections for stories and eventsGenerating collections for stories and events
Generating collections for stories and eventsAlexander Nwala
 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesMartin Klein
 
The Off-Topic Memento Toolkit
The Off-Topic Memento ToolkitThe Off-Topic Memento Toolkit
The Off-Topic Memento ToolkitShawn Jones
 
How to Write an Academic Paper
How to Write an Academic PaperHow to Write an Academic Paper
How to Write an Academic PaperMichele Weigle
 
IBM Connections REST-API Waltz
IBM Connections REST-API WaltzIBM Connections REST-API Waltz
IBM Connections REST-API WaltzHenning Schmidt
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationUsing the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationMartin Klein
 
How to Prepare and Give and Academic Presentation
How to Prepare and Give and Academic PresentationHow to Prepare and Give and Academic Presentation
How to Prepare and Give and Academic PresentationMichele Weigle
 
Resources in uct libraries acc_hons_shellyh_2017
Resources in uct libraries acc_hons_shellyh_2017Resources in uct libraries acc_hons_shellyh_2017
Resources in uct libraries acc_hons_shellyh_2017Susanne Noll
 
Session 1.4 a distributed network of heritage information
Session 1.4   a distributed network of heritage informationSession 1.4   a distributed network of heritage information
Session 1.4 a distributed network of heritage informationsemanticsconference
 
A distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamA distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamEnno Meijers
 
2015 07-08-wikidocks
2015 07-08-wikidocks2015 07-08-wikidocks
2015 07-08-wikidocksErika Herzog
 
OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.OCLC
 
Collaboration and Cash: Web Archiving Incentive Awards
Collaboration and Cash: Web Archiving Incentive AwardsCollaboration and Cash: Web Archiving Incentive Awards
Collaboration and Cash: Web Archiving Incentive AwardsAnna Perricci
 

Similar to Creating Topical Collections: Web Archives vs. Live Web (20)

Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event Collections
 
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
 
Webinar: SOS Save Our Site! Archiving Web Content-2017-08-10
Webinar: SOS Save Our Site! Archiving Web Content-2017-08-10Webinar: SOS Save Our Site! Archiving Web Content-2017-08-10
Webinar: SOS Save Our Site! Archiving Web Content-2017-08-10
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
 
web_archive_interoperability_memento
web_archive_interoperability_mementoweb_archive_interoperability_memento
web_archive_interoperability_memento
 
Generating collections for stories and events
Generating collections for stories and eventsGenerating collections for stories and events
Generating collections for stories and events
 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web Archives
 
The Off-Topic Memento Toolkit
The Off-Topic Memento ToolkitThe Off-Topic Memento Toolkit
The Off-Topic Memento Toolkit
 
How to Write an Academic Paper
How to Write an Academic PaperHow to Write an Academic Paper
How to Write an Academic Paper
 
IBM Connections REST-API Waltz
IBM Connections REST-API WaltzIBM Connections REST-API Waltz
IBM Connections REST-API Waltz
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationUsing the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly Communication
 
How to Prepare and Give and Academic Presentation
How to Prepare and Give and Academic PresentationHow to Prepare and Give and Academic Presentation
How to Prepare and Give and Academic Presentation
 
Davis Digital Preservation and the Web: Challenges for Libraries
Davis Digital Preservation and the Web: Challenges for LibrariesDavis Digital Preservation and the Web: Challenges for Libraries
Davis Digital Preservation and the Web: Challenges for Libraries
 
Resources in uct libraries acc_hons_shellyh_2017
Resources in uct libraries acc_hons_shellyh_2017Resources in uct libraries acc_hons_shellyh_2017
Resources in uct libraries acc_hons_shellyh_2017
 
Session 1.4 a distributed network of heritage information
Session 1.4   a distributed network of heritage informationSession 1.4   a distributed network of heritage information
Session 1.4 a distributed network of heritage information
 
A distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamA distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics Amsterdam
 
2015 07-08-wikidocks
2015 07-08-wikidocks2015 07-08-wikidocks
2015 07-08-wikidocks
 
OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.
 
Collaboration and Cash: Web Archiving Incentive Awards
Collaboration and Cash: Web Archiving Incentive AwardsCollaboration and Cash: Web Archiving Incentive Awards
Collaboration and Cash: Web Archiving Incentive Awards
 
"In the Early Days of a Better Nation": Enhancing the power of metadata today...
"In the Early Days of a Better Nation": Enhancing the power of metadata today..."In the Early Days of a Better Nation": Enhancing the power of metadata today...
"In the Early Days of a Better Nation": Enhancing the power of metadata today...
 

More from Martin Klein

On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebMartin Klein
 
On the Persistence of Persistent Identifiers of the Scholarly Web
 On the Persistence of Persistent Identifiers of the Scholarly Web On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebMartin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansMartin Klein
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Who is Asking - Humans and Machines  Experience a Different Scholarly WebWho is Asking - Humans and Machines  Experience a Different Scholarly Web
Who is Asking - Humans and Machines Experience a Different Scholarly WebMartin Klein
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...Martin Klein
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...Martin Klein
 
Comparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncMartin Klein
 
Evaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsEvaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsMartin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansMartin Klein
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly ArtifactsMartin Klein
 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento RequestsMartin Klein
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsMartin Klein
 
Uniform Access to Raw Mementos
Uniform Access to Raw MementosUniform Access to Raw Mementos
Uniform Access to Raw MementosMartin Klein
 
Robust Links - a proposed solution to reference rot in scholarly communication
Robust Links - a proposed solution to reference rot in scholarly communicationRobust Links - a proposed solution to reference rot in scholarly communication
Robust Links - a proposed solution to reference rot in scholarly communicationMartin Klein
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...Martin Klein
 
Comparing Published Scientific Journal Articles to Their Pre-print Versions
Comparing Published Scientific Journal Articles  to Their Pre-print VersionsComparing Published Scientific Journal Articles  to Their Pre-print Versions
Comparing Published Scientific Journal Articles to Their Pre-print VersionsMartin Klein
 
Preserving Born-Digital News Panel JCDL 2016
Preserving Born-Digital News Panel JCDL 2016Preserving Born-Digital News Panel JCDL 2016
Preserving Born-Digital News Panel JCDL 2016Martin Klein
 
How much does $1.7 billion buy?
How much does $1.7 billion buy?How much does $1.7 billion buy?
How much does $1.7 billion buy?Martin Klein
 
SoLoGlo - iPres Panel
SoLoGlo - iPres PanelSoLoGlo - iPres Panel
SoLoGlo - iPres PanelMartin Klein
 
Linking Born-Digital News and Social Media Collections via Automated Entit...
Linking Born-Digital News and Social Media Collections via Automated Entit...Linking Born-Digital News and Social Media Collections via Automated Entit...
Linking Born-Digital News and Social Media Collections via Automated Entit...Martin Klein
 

More from Martin Klein (20)

On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
On the Persistence of Persistent Identifiers of the Scholarly Web
 On the Persistence of Persistent Identifiers of the Scholarly Web On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Who is Asking - Humans and Machines  Experience a Different Scholarly WebWho is Asking - Humans and Machines  Experience a Different Scholarly Web
Who is Asking - Humans and Machines Experience a Different Scholarly Web
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
 
Comparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSync
 
Evaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsEvaluating Memento Service Optimizations
Evaluating Memento Service Optimizations
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly Artifacts
 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento Requests
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly Artifacts
 
Uniform Access to Raw Mementos
Uniform Access to Raw MementosUniform Access to Raw Mementos
Uniform Access to Raw Mementos
 
Robust Links - a proposed solution to reference rot in scholarly communication
Robust Links - a proposed solution to reference rot in scholarly communicationRobust Links - a proposed solution to reference rot in scholarly communication
Robust Links - a proposed solution to reference rot in scholarly communication
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
 
Comparing Published Scientific Journal Articles to Their Pre-print Versions
Comparing Published Scientific Journal Articles  to Their Pre-print VersionsComparing Published Scientific Journal Articles  to Their Pre-print Versions
Comparing Published Scientific Journal Articles to Their Pre-print Versions
 
Preserving Born-Digital News Panel JCDL 2016
Preserving Born-Digital News Panel JCDL 2016Preserving Born-Digital News Panel JCDL 2016
Preserving Born-Digital News Panel JCDL 2016
 
How much does $1.7 billion buy?
How much does $1.7 billion buy?How much does $1.7 billion buy?
How much does $1.7 billion buy?
 
SoLoGlo - iPres Panel
SoLoGlo - iPres PanelSoLoGlo - iPres Panel
SoLoGlo - iPres Panel
 
Linking Born-Digital News and Social Media Collections via Automated Entit...
Linking Born-Digital News and Social Media Collections via Automated Entit...Linking Born-Digital News and Social Media Collections via Automated Entit...
Linking Born-Digital News and Social Media Collections via Automated Entit...
 

Recently uploaded

Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleanscorenetworkseo
 
Intellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxIntellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxBipin Adhikari
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 

Recently uploaded (20)

Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleans
 
Intellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxIntellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptx
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 

Creating Topical Collections: Web Archives vs. Live Web

  • 1. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC Creating Topical Collections: Web Archives vs. Live Web Martin Klein @mart1nkle1n Research Library Los Alamos National Laboratory
  • 2. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 2 Team Work Lyudmila BalakirevaHerbert Van de Sompel @hvdsomp
  • 3. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 3 • Live web is dynamic, lives in a “perpetual now” • Subject to link rot and content drift ( = reference rot)* • Significant platform/source for news publication/consumption Background - Live Web http://archive.is/FhdK6 • Pew Research Center survey from August 2017: • 43% often get news online • 50% often get news from TV • 38% and 57% in early 2016 *See: https://doi.org/10.1371/journal.pone.0115253 https://doi.org/10.1371/journal.pone.0167475
  • 4. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 4 • Often orchestrated by subject matter experts, archivists, special collection librarians, technicians • Potentially with guidance from institutional collection policy • Results in a list of seeds (URIs, social media accounts, etc) • Utilization of crawling services such as Archive-It, Social Feed Manager • Relevance of seeds assessed by humans • Time passed since event is a concern because: • Stories evolve • Reference rot • API restrictions Background – Collection Building from the Live Web
  • 5. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 5 • Web archives are an invaluable resource for researchers, historians, journalists, etc. • Often broad in scope, large in scale, covering different temporal intervals • Makes discovery, access, and analysis difficult • In particular, for topic-specific resources Background – Archived Web
  • 6. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 6 Memento allows to access many web archives, simultaneously! Access to the Archived Web http://timetravel.mementoweb.org/ http://mementoweb.org/guide/rfc/
  • 7. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 7 <Intermezzo>
  • 8. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 8 Web Crawling
  • 9. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 9 Web Crawling
  • 10. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 10 Web Crawling
  • 11. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 11 Web Crawling
  • 12. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 12 Focused Crawling Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2 Not crawled Crawled and not relevant Crawled and relevant
  • 13. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 13 Focused Crawling Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2 Not crawled Crawled and not relevant Crawled and relevant
  • 14. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 14 Focused Crawling Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2 Not crawled Crawled and not relevant Crawled and relevant
  • 15. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 15 </Intermezzo>
  • 16. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 16 Inspiration from Previous Work https://doi.org/10.1007/978-3-319-67008-9_10
  • 17. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 17 • Extract event-centric document collections via focused crawling of an archive • Archive = web pages from .de top-level domain, captured by the Internet Archive until 2013 (30TB, 4b captures, 1b URIs) • Identified 28 topics, likely covered in archive • Text of topics’ Wikipedia page used for content relevance evaluation • Crawled page datetime used for temporal relevance evaluation • Overall relevance = content relevance + temporal relevance • Wikipedia page outlinks used as seeds for focused crawl Previous Work - Setup
  • 18. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 18 Previous Work – Results
  • 19. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 19 • Can we create high-quality topical collections by focused crawling online-available web archives? • What is the effect of including multiple archives in the crawl? • How do collections created from the archived web compare to those created from the live web? • How does the amount of time passed since the event affect the quality of the collection? Our Questions
  • 20. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 20 • Topics limited to terror attacks and mass shootings in the U.S. • From different times in the past • Focused crawl of: • 22 archives, simultaneously, via Memento infrastructure • the live web • Take content and temporal relevance into account • Equally weighted: R = (0.5 x CR) + (0.5 x TR) • Use events’ Wikipedia page as input for focused crawler Our Experiment
  • 21. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 21 1. Content of Wikipedia page + random 60% of page’s references • Generate topic vector (TF-IDF of 1grams + 2grams) 2. Content of remaining 40% of Wikipedia page’s outlinks • Generate topic vector (TF-IDF of 1grams + 2grams) • Compute cosine similarity value between vectors 1 and 2 • Run 10 times • Take average similarity value as content threshold Content Relevance
  • 22. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 22 • Define temporal interval for which crawled pages are considered relevant • Event date extracted from Wikipedia event page • Change point determined from graph of proportional Wikipedia page edits per day Temporal Relevance 1 Event Date Change Point Today 0 0
  • 23. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 23 Change Point Detection 2016−06−12 2016−11−05 2017−03−31 2017−08−24 020406080100 Edit Dates Percentage 46
  • 24. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 24 • Extract datetime from pages via: • URI http://www.cnn.com/2017/12/09/us/wildfire-fighting-tactics/ • Meta tags <meta property="article:published" itemprop="datePublished" content="2017-12-09T10:14:50-05:00" /> • ODU’s Carbondate tool http://carbondate.cs.odu.edu/ • Memento datetime • X-Header Datetime Extraction
  • 25. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 25 • Use version of Wikipedia page that was live at change point • Possible crawl stop conditions: • Total number of documents crawled • Accumulated size of crawled documents • Time elapsed since crawl started • Crawl x levels deep • No more relevant documents left • Our pick: • Crawl relevant documents • 5 levels deep • with priority queue Crawls Level 2 Level 1 Level 0 Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
  • 26. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 26 • New York City, October 31st 2017 • Las Vegas, October 1st 2017 • Orlando, June 12th 2016 • San Bernadino, December 2nd 2015 • Tucson, January 8th 2011 • Binghampton, April 3rd 2009 Collections Crawled (in late November)
  • 27. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 27 NYC, 10/31/2017 – URIs per Level 0 1 2 3 4 5 0500100015002000 0102030405060708090100 All URIs Relevant URIs 0 1 2 3 4 5 0500100015002000 0102030405060708090100 All URIs Relevant URIs Archived Crawl Live Crawl Levels Levels
  • 28. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 28 Intermezzo – Focused Crawling Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2 Not crawled Crawled and not relevant Crawled and relevant
  • 29. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 29 NYC, 10/31/2017 – Relevance over URIs Relevant Documents All Crawled Documents 0 200 400 600 800 0100200300400500600 Documents AccumulatedRelevance Archived Live 0 1000 2000 3000 4000 5000 050010001500 Documents AccumulatedRelevance Archived Live
  • 30. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 30 NYC, 10/31/2017 – Relevance over Crawl Time Relevant Documents All Crawled Documents 0 50000 100000 150000 200000 0100200300400500600 Time in Seconds AccumulatedRelevance Archived Live 0 50000 100000 150000 200000 050010001500 Time in Seconds AccumulatedRelevance Archived Live
  • 31. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 31 NYC, 10/31/2017 – Web Archive Distribution 0500100015002000 w eb.archive.org w ayback.archive−it.org archive.is perm a−archives.org w ebarchive.nationalarchives.gov.uk All Mementos Relevant Mementos
  • 32. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 32 Binghampton, April 3rd 2009 – URIs per Level Archived Crawl Live Crawl Levels Levels 0 1 2 3 4 5 0200400600800100012001400 0102030405060708090100 All URIs Relevant URIs 0 1 2 3 4 0200400600800100012001400 0102030405060708090100 All URIs Relevant URIs
  • 33. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 33 Binghampton, April 3rd 2009 – Relevance over URIs Relevant Documents All Crawled Documents 0 100 200 300 400 500 600 0100200300400 Documents AccumulatedRelevance Archived Live 0 1000 2000 3000 4000 5000 6000 050010001500200025003000 Documents AccumulatedRelevance Archived Live
  • 34. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 34 Binghampton, April 3rd 2009 – Relevance over Crawl Time Relevant Documents All Crawled Documents 0 50000 100000 150000 200000 250000 0100200300400 Time in Seconds AccumulatedRelevance Archived Live 0 50000 100000 150000 200000 250000 050010001500200025003000 Time in Seconds AccumulatedRelevance Archived Live
  • 35. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 35 Binghampton, April 3rd 2009 – Web Archive Distribution 01000200030004000 w eb.archive.org w ebarchive.loc.gov w ayback.archive−it.org arquivo.pt sw ap.stanford.edu archive.is w eb.archive.bibalex.org:80 All Mementos Relevant Mementos
  • 36. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 36 San Bernadino, December 2nd 2015 – URIs per Level Archived Crawl Live Crawl Levels Levels 0 1 2 3 4 5 0500100015002000250030003500 0102030405060708090100 All URIs Relevant URIs 0 1 2 3 4 5 0500100015002000250030003500 0102030405060708090100 All URIs Relevant URIs
  • 37. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 37 San Bernadino, December 2nd 2015 – Relevance over URIs Relevant Documents All Crawled Documents 0 500 1000 1500 2000 2500 0500100015002000 Documents AccumulatedRelevance Archived Live 0 2000 4000 6000 8000 10000 12000 010002000300040005000 Documents AccumulatedRelevance Archived Live
  • 38. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 38 San Bernadino, December 2nd 2015 – Relevance over Crawl Time Relevant Documents All Crawled Documents 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 0500100015002000 Time in Seconds AccumulatedRelevance Archived Live 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 010002000300040005000 Time in Seconds AccumulatedRelevance Archived Live
  • 39. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 39 San Bernadino, December 2nd 2015 – Web Archive Distribution 02000400060008000 w eb.archive.org w ayback.archive−it.org w ebarchive.loc.govarchive.isarquivo.pt w ayback.vefsafn.is collection.europarchive.org perm a−archives.org digital.library.yorku.ca w ebarchive.org.uk All Mementos Relevant Mementos
  • 40. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 40 • Web archives are good resources to build topical collections of web resources • Utilizing multiple web archives is beneficial for the collection • Crawling web archives is much slower than the live web • Collections about very recent events benefit more from the live web than the archived web but • Collections about events from the distant past benefit more from archives than the live web but • Collections about less recent events can (still) benefit from the live web and (already) from the archived web Take-Aways
  • 41. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 41 • Forgive one level of “irrelevance” • Compare with manually curated collections (from AIT) • Diversify to international topics and beyond shootings • Investigate questions of optimal start and end time of crawls Where to go next
  • 42. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC 42 https://web.archive.org/web/20171206181955/https:/twitter.com/TVNewsArchive/status/938466726190096384
  • 43. Creating Topical Collections: Web Archives vs. Live Web @mart1nkle1n CNI Fall Meeting 2017, 12/11/2017, Washington, DC Creating Topical Collections: Web Archives vs. Live Web Martin Klein @mart1nkle1n Research Library Los Alamos National Laboratory 0 1 2 3 4 5 0500100015002000250030003500 0102030405060708090100 All URIs Relevant URIs 0 1 2 3 4 5 0500100015002000250030003500 0102030405060708090100 All URIs Relevant URIs