SlideShare a Scribd company logo
EVALUATING THE
TEMPORAL COHERENCE OF
ARCHIVED PAGES
SCOTT G. AINSWORTH
MICHAEL L. NELSON
OLD DOMINION UNIVERSITY
IIPC 2015
APRIL 27–MAY 1, 2015
STANFORD UNIVERSITY
HERBERT VAN DE SOMPEL
LOS ALAMOS NATIONAL
LABORATORY
HE WENT TO VIEW AN
ARCHIVED PAGE. YOU
WON’T BELIEVE WHAT HE
SAW NEXT…
SCOTT G. AINSWORTH
MICHAEL L. NELSON
OLD DOMINION UNIVERSITY
IIPC 2015
APRIL 27–MAY 1, 2015
STANFORD UNIVERSITY
HERBERT VAN DE SOMPEL
LOS ALAMOS NATIONAL
LABORATORY
2015IIPCGeneralAssembly
CONTENTS
 Motivation
 Composite Mementos
 Coherence Framework
 Temporal Coherence
 Future Work
 Conclusion
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
3
2015IIPCGeneralAssembly
RESEARCH TO DATE
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
4
• How to crawl a site to maximize coherence
• Ben Saad et al., JCDL 2011, TPDL 2011
• Detecting, visualizing temporal defects
• Spaniol et al., WICOW 2009, IWAW 2009, VLDB
2009
2015IIPCGeneralAssembly
RESEARCH TO DATE: @WEBSCIDL
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
5
• How much of the web is archived?
• Ainsworth et al., JCDL 2011
• Are the archives stable?
• Brunelle et al., JCDL 2013
• Temporal drift while browsing in an archive?
• Ainsworth et al., JCDL 2013
• Are the missing resources important?
• Brunelle et al., JCDL 2014
• Are the present resources correct?
2015IIPCGeneralAssembly
AS PRESENTED BY IA
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
6
http://web.archive.org/web/20041209190926/http://www.wunderground.org/cgi-bin/findWeather/getForecast?query=50593 (now 404, but that's a different story…)
2015IIPCGeneralAssembly
NOT ALL 2004-12-09T19:09:26
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
7
2015IIPCGeneralAssembly
CLEAR OR CLOUDY?
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
8
2015IIPCGeneralAssembly
QUESTIONS
• How prevalent is temporal incoherence?
• Can temporal coherence be improved by using
multiple archives?
• Can temporal coherence be improved by
introducing memento selection heuristics?
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
9
2015IIPCGeneralAssembly
CONTENTS
 Motivation
 Composite Memento
 Coherence Framework
 Temporal Coherence
 Future Work
 Conclusion
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
10
2015IIPCGeneralAssembly
COMPOSITE MEMENTO
PRESENTATION STRUCTURE
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
11
URI-M0
URI-M1 URI-M2 URI-Mi-1
...
URI-Mi URI-Mi+1 URI-Mn...
2015IIPCGeneralAssembly
CONTENTS
 Motivation
 Composite Memento
 Coherence Framework
 Temporal Coherence
 Future Work
 Conclusion
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
12
2015IIPCGeneralAssembly
COHERENCE STATES
• Prima Facie Coherent
Evidence that the memento existed in its archived
state when the root was acquired.
• Prima Facie Violative
Evidence … did not exist ...
• Possibly Coherent
Evidence … might have existed ...
• Probably Violative
Evidence … probably did not exist ...
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
13
2015IIPCGeneralAssembly
CONSIDER THIS PAGE…
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
14
<html>
<img src="foo.jpeg">
</html>
2015IIPCGeneralAssembly
WITH THESE RESPONSE HEADERS
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
15
HTTP/1.1 200 OK
Server: Tengine/2.0.3
Date: Mon, 27 Apr 2015 22:03:32 GMT
Content-Type: image/jpeg
Content-Length: 15632
Connection: keep-alive
Memento-Datetime: Tue, 07 Feb 2006 00:58:23 GMT
Link: <Memento links deleted...>
X-Archive-Orig-server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4
X-Archive-Orig-etag: "4978-3d10-3e4d822e"
X-Archive-Orig-content-length: 15632
X-Archive-Orig-accept-ranges: bytes
X-Archive-Orig-date: Tue, 07 Feb 2006 00:58:20 GMT
X-Archive-Orig-content-type: image/jpeg
X-Archive-Orig-last-modified: Fri, 14 Feb 2003 23:56:30 GMT
X-Archive-Orig-connection: close
<other headers deleted>
2015IIPCGeneralAssembly
PRIMA FACIE COHERENT
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
16
Bracket Pattern:
Memento-Datetime + Last-Modified
(yes, Last-Modified is sometimes wrong, but many of those cases can be detected)
2015IIPCGeneralAssembly
PRIMA FACIE COHERENT
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
17
Equal Pattern: simultaneous capture
(with an optionally tunable “bubble of simultaneity”)
2015IIPCGeneralAssembly
PRIMA FACIE VIOLATIVE
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
18
Closest memento created and acquired after
the root was acquired
2015IIPCGeneralAssembly
POSSIBLY COHERENT
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
19
Closest (or only) memento captured before the
root
2015IIPCGeneralAssembly
PROBABLY VIOLATIVE
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
20
Closest (or only) memento captured after the
root but no Last-Modified (possibly indicating a
dynamically generated representations)
(for both PC & PV, you could do content comparison if there are 2
mementos that straddle the root page)
2015IIPCGeneralAssembly
CONTENTS
 Motivation
 Composite Memento
 Coherence Framework
 Temporal Coherence
 Future Work
 Conclusion
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
21
2015IIPCGeneralAssembly
TEMPORAL COHERENCE
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
22
2015IIPCGeneralAssembly
TEMPORAL COHERENCE
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
23
2005-05-
14
01:36:08
+9 days
+18 days +18 days
+7 months
+2.1 years
2015IIPCGeneralAssembly
EMBEDDED RESOURCES
Resource Memento-Datetime Delta Resource
Memento-
Datetime
Delta
http://www.cs.odu.edu 2005-05-14 01:36:08 spacer.gif 2005-06-01 16:23:10 18.6 d
mm_menu.js 2005-05-23 02:39:12 9.0 d jimcheng.gif 2005-06-01 16:37:39 18.6 d
style.css 2005-05-23 02:39:39 9.0 d jsmith.gif 2005-06-01 16:58:50 18.6 d
gfx-logo-odu-crown.gif 2005-05-23 02:39:39 9.0 d rmenu_1st_featured_alumni.png 2005-06-01 21:21:45 18.8 d
ddmenu_ddown.js 2005-05-23 02:39:43 9.0 d hmenu_college_...-new.png 2005-12-21 20:14:25 7.3 mo
university.js 2005-05-23 02:39:56 9.0 d rmenu_1st_upcoming_news.png 2005-12-21 20:15:14 7.3 mo
rmenu_1st_about.png 2005-06-01 13:40:25 18.5 d rmenu_1st_upcoming_events.png 2005-12-21 21:01:12 7.3 mo
rmenu_bottom_229.gif 2005-06-01 14:07:29 18.5 d lmenu_1st_resources.png 2005-12-28 17:47:41 7.5 mo
shadow-bl.gif 2005-06-01 14:55:53 18.6 d bullet_blue_triangle.gif 2005-12-28 19:43:48 7.5 mo
ecsbdg.jpg 2005-06-01 14:56:17 18.6 d logo-cs.gif 2005-12-28 19:54:29 7.5 mo
shadow-br.gif 2005-06-01 15:18:18 18.6 d rmenu_1st_featured_student.png 2007-06-12 02:36:07 2.1 years
gfx-btn-go-dblue.gif 2005-06-01 15:34:19 18.6 d shadow-b.gif 2007-06-21 02:35:17 2.1 years
shadow-tr.gif 2005-06-01 15:55:57 18.6 d shadow-r.gif 404 Not Found
header-right1.gif 2005-06-01 16:06:16 18.6 d
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
24
Embedded Resources 26
Mean Delta 125.9 days
Standard Deviation 207.7 days
Minimum Delta 9.0 days
Maximum Delta 2.1 years
2015IIPCGeneralAssembly
REPRESENTING COHERENCE
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
25
2015IIPCGeneralAssembly
REPRESENTING COHERENCE
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
26
2015IIPCGeneralAssembly
REPRESENTING COHERENCE
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
27
2015IIPCGeneralAssembly
REPRESENTING COHERENCE
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
28
2015IIPCGeneralAssembly
REPRESENTING COHERENCE
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
29
2015IIPCGeneralAssembly
THE FULL CHART
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
30
2005-03-10
2015IIPCGeneralAssembly
EXPERIMENT: DATA SET
• 4,000 sample URI-Rs (data set from JCDL 2011)
• Single and Multiple Archives
• Two Heuristics:
• Minimum distance (current default Wayback behavior)
• choose closest Memento-Datetime
• Bracket (proposed here)
• use combination of Memento-Datetime + Last-Modified
• Download all TimeMaps
• Download all root mementos
• Download all embedded resources
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
31
2015IIPCGeneralAssembly
EXPERIMENT: SAMPLING
• For each root URI-R TimeMap, choose a single
memento per month
• Extract embedded URI-Rs
• Download TimeMaps for embedded URI-Rs
• Download heuristically best URI-Ms
• Repeat recursively
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
32
2015IIPCGeneralAssembly
ROOT URI-R STATISTICS
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
33
Root URI-Rs archived 2,756 • 68.9%
In multiple archives 1,180 • 29.5%
Mean archives per URI-R 1.58
Mean mementos per URI-R 124.57
200 OK 82,425 • 93.6%
503 Service Unavailable 4,444 • 5.0%
404 Not found 583 • 0.7%
403 Forbidden 388 • 0.4%
Others 214 • 0.3%
URI-M Status
Archival Data
2015IIPCGeneralAssembly
EMBEDDED URI-R STATISTICS
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
34
Embedded URI-Rs 1,623,127
per root URI-M 19.7
Embedded URI-Ms available 1,332,993 • 93.6%
per root URI-M 15.1
Not archived 312,641 • 83.9%
404 Not found 44,852 • 12.0%
403 Forbidden 6,116 • 1.6%
503 Service Unavailable 5,442 • 1.5%
Others 3,508 • 0.9%
URI-M Failure Reasons
Archival Data
2015IIPCGeneralAssembly
COMPOSITE MEMENTO (ROOT)
COMPLETENESS & COHERENCE
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
35
Description
MinDist
Single
MinDist
Multi
Bracket
Single
Bracket
Multi
Mean Complete 76.1% 80.2% 76.2% 80.3%
Mean Missing 23.9% 19.8% 23.8% 19.7%
Completeness (and Missing)
Description
MinDist
Single
MinDist
Multi
Bracket
Single
Bracket
Multi
Mean Prima Facie Coherent 41.0% 40.9% 54.7% 54.6%
Mean Possibly Coherent 27.3% 28.7% 12.8% 14.2%
Mean Probably Violative 2.5% 5.3% 2.5% 5.3%
Mean Prima Facie Violative 5.3% 5.3% 6.2% 6.2%
Coherence
At least 5% of pages can be shown to have temporal violations!
Multiple archives: +completeness, -coherence?
2015IIPCGeneralAssembly
EMBEDDED MEMENTO COHERENCE
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
36
Description
MinDist
Single
MinDist
Multi
Bracket
Single
Bracket
Multi
Prima Facie Coherent 622,565 621,447 864,736 859,625
Possibly Coherent 497,405 466,046 244,104 215,585
Probably Violative 104,376 53,734 104,339 53,694
Prima Facie Violative 100,760 103,662 114,062 117,469
Totals 1,325,106 1,244,889 1,327,241 1,246,373
Description
MinDist
Single
MinDist
Multi
Bracket
Single
Bracket
Multi
Prima Facie Coherent 47.0% 49.9% 65.2% 69.0%
Possibly Coherent 37.5% 37.4% 18.4% 17.3%
Probably Violative 7.9% 4.3% 7.9% 4.3%
Prima Facie Violative 7.6% 8.3% 8.6% 9.4%
At least 7% of embedded resources are used violatively!
2015IIPCGeneralAssembly
CONTENTS
 Motivation
 Related work
 Preliminary work
 Temporal Coherence
 Future work
 Conclusion
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
37
2015IIPCGeneralAssembly
MINOR OR MAJOR VIOLATIONS?
• This is a temporal violation. But is it
meaningful?
• How to judge?
• Most archives transform HTML
• Not all archives support export of original file
• How to measure similarity on binary files?
• early results: very few cases of equivalent binaries
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
38
2015IIPCGeneralAssembly
HOW TO CONVEY COHERENCE?
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
39
How to scale to > 100
embedded mementos?
How to convey coherence
& contributing archive?
2015IIPCGeneralAssembly
POLICIES & HEURISTICS
• Tradeoffs:
• Fast: minimize distance
• Accurate: maximize coherence
• Complete: query all (not just top k) archives in
order to maximize completeness
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
40
2015IIPCGeneralAssembly
CONTENTS
 Motivation
 Composite Memento
 Coherence Framework
 Temporal Coherence
 Future Work
 Conclusion
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
41
2015IIPCGeneralAssembly
CONCLUSION
• Defined four classes of temporal coherence for describing
relationship between root & embedded mementos
• Prima Facie {Coherent|Violative}
• Possibly Coherent / Probably Violative
• Determine classes using a combination of HTTP metadata,
primarily Memento-Datetime & Last-Modified
• At least 5% of IA pages have 1 or more temporal violations
• Using multiple archives increases completeness, but with a
possible loss of coherence
• Determining semantic impact of violations and UI issues
(status, policy choices) are areas of future research
4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel
42

More Related Content

Viewers also liked

The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web Archiving
Michael Nelson
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research Object
Yasmin AlNoamany, PhD
 
Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member
Michael Nelson
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Michael Nelson
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
Sawood Alam
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015
Michael Nelson
 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better
Michael Nelson
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
Michael Nelson
 
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Michael Nelson
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
Michael Nelson
 
On the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeOn the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over Time
Michael Nelson
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Michael Nelson
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
Michael Nelson
 
Assessing the Quality of Web Archives
Assessing the Quality of Web ArchivesAssessing the Quality of Web Archives
Assessing the Quality of Web Archives
Michael Nelson
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniques
Michael Nelson
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Michael Nelson
 
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
OAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange ProjectOAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange Project
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
Michael Nelson
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?
Michael Nelson
 

Viewers also liked (18)

The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web Archiving
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research Object
 
Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015
 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
 
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content Language
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
 
On the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeOn the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over Time
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
 
Assessing the Quality of Web Archives
Assessing the Quality of Web ArchivesAssessing the Quality of Web Archives
Assessing the Quality of Web Archives
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniques
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
OAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange ProjectOAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange Project
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?
 

Similar to Evaluating the Temporal Coherence of Archived Pages

Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in ...
Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in ...Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in ...
Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in ...
ScottAinsworth
 
HypTrails: A Bayesian Approach for Comparing Hypotheses about Human Trails on...
HypTrails: A Bayesian Approach for Comparing Hypotheses about Human Trails on...HypTrails: A Bayesian Approach for Comparing Hypotheses about Human Trails on...
HypTrails: A Bayesian Approach for Comparing Hypotheses about Human Trails on...
Philipp Singer
 
Browsing and Recomposition Policies to Minimize Temporal Error When Utilizing...
Browsing and Recomposition Policies to Minimize Temporal Error When Utilizing...Browsing and Recomposition Policies to Minimize Temporal Error When Utilizing...
Browsing and Recomposition Policies to Minimize Temporal Error When Utilizing...
ScottAinsworth
 
MOA 2015, Keynote - Open All The Things
MOA 2015, Keynote - Open All The ThingsMOA 2015, Keynote - Open All The Things
MOA 2015, Keynote - Open All The Things
Kungliga biblioteket National Library of Sweden
 
How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011
Ahmed AlSum
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpedia
Heiko Paulheim
 

Similar to Evaluating the Temporal Coherence of Archived Pages (6)

Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in ...
Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in ...Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in ...
Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in ...
 
HypTrails: A Bayesian Approach for Comparing Hypotheses about Human Trails on...
HypTrails: A Bayesian Approach for Comparing Hypotheses about Human Trails on...HypTrails: A Bayesian Approach for Comparing Hypotheses about Human Trails on...
HypTrails: A Bayesian Approach for Comparing Hypotheses about Human Trails on...
 
Browsing and Recomposition Policies to Minimize Temporal Error When Utilizing...
Browsing and Recomposition Policies to Minimize Temporal Error When Utilizing...Browsing and Recomposition Policies to Minimize Temporal Error When Utilizing...
Browsing and Recomposition Policies to Minimize Temporal Error When Utilizing...
 
MOA 2015, Keynote - Open All The Things
MOA 2015, Keynote - Open All The ThingsMOA 2015, Keynote - Open All The Things
MOA 2015, Keynote - Open All The Things
 
How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpedia
 

More from Michael Nelson

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Michael Nelson
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pages
Michael Nelson
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
Michael Nelson
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
Michael Nelson
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Michael Nelson
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Michael Nelson
 

More from Michael Nelson (9)

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pages
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 

Recently uploaded

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 

Evaluating the Temporal Coherence of Archived Pages

  • 1. EVALUATING THE TEMPORAL COHERENCE OF ARCHIVED PAGES SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY IIPC 2015 APRIL 27–MAY 1, 2015 STANFORD UNIVERSITY HERBERT VAN DE SOMPEL LOS ALAMOS NATIONAL LABORATORY
  • 2. HE WENT TO VIEW AN ARCHIVED PAGE. YOU WON’T BELIEVE WHAT HE SAW NEXT… SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY IIPC 2015 APRIL 27–MAY 1, 2015 STANFORD UNIVERSITY HERBERT VAN DE SOMPEL LOS ALAMOS NATIONAL LABORATORY
  • 3. 2015IIPCGeneralAssembly CONTENTS  Motivation  Composite Mementos  Coherence Framework  Temporal Coherence  Future Work  Conclusion 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 3
  • 4. 2015IIPCGeneralAssembly RESEARCH TO DATE 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 4 • How to crawl a site to maximize coherence • Ben Saad et al., JCDL 2011, TPDL 2011 • Detecting, visualizing temporal defects • Spaniol et al., WICOW 2009, IWAW 2009, VLDB 2009
  • 5. 2015IIPCGeneralAssembly RESEARCH TO DATE: @WEBSCIDL 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 5 • How much of the web is archived? • Ainsworth et al., JCDL 2011 • Are the archives stable? • Brunelle et al., JCDL 2013 • Temporal drift while browsing in an archive? • Ainsworth et al., JCDL 2013 • Are the missing resources important? • Brunelle et al., JCDL 2014 • Are the present resources correct?
  • 6. 2015IIPCGeneralAssembly AS PRESENTED BY IA 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 6 http://web.archive.org/web/20041209190926/http://www.wunderground.org/cgi-bin/findWeather/getForecast?query=50593 (now 404, but that's a different story…)
  • 7. 2015IIPCGeneralAssembly NOT ALL 2004-12-09T19:09:26 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 7
  • 8. 2015IIPCGeneralAssembly CLEAR OR CLOUDY? 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 8
  • 9. 2015IIPCGeneralAssembly QUESTIONS • How prevalent is temporal incoherence? • Can temporal coherence be improved by using multiple archives? • Can temporal coherence be improved by introducing memento selection heuristics? 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 9
  • 10. 2015IIPCGeneralAssembly CONTENTS  Motivation  Composite Memento  Coherence Framework  Temporal Coherence  Future Work  Conclusion 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 10
  • 11. 2015IIPCGeneralAssembly COMPOSITE MEMENTO PRESENTATION STRUCTURE 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 11 URI-M0 URI-M1 URI-M2 URI-Mi-1 ... URI-Mi URI-Mi+1 URI-Mn...
  • 12. 2015IIPCGeneralAssembly CONTENTS  Motivation  Composite Memento  Coherence Framework  Temporal Coherence  Future Work  Conclusion 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 12
  • 13. 2015IIPCGeneralAssembly COHERENCE STATES • Prima Facie Coherent Evidence that the memento existed in its archived state when the root was acquired. • Prima Facie Violative Evidence … did not exist ... • Possibly Coherent Evidence … might have existed ... • Probably Violative Evidence … probably did not exist ... 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 13
  • 14. 2015IIPCGeneralAssembly CONSIDER THIS PAGE… 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 14 <html> <img src="foo.jpeg"> </html>
  • 15. 2015IIPCGeneralAssembly WITH THESE RESPONSE HEADERS 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 15 HTTP/1.1 200 OK Server: Tengine/2.0.3 Date: Mon, 27 Apr 2015 22:03:32 GMT Content-Type: image/jpeg Content-Length: 15632 Connection: keep-alive Memento-Datetime: Tue, 07 Feb 2006 00:58:23 GMT Link: <Memento links deleted...> X-Archive-Orig-server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 X-Archive-Orig-etag: "4978-3d10-3e4d822e" X-Archive-Orig-content-length: 15632 X-Archive-Orig-accept-ranges: bytes X-Archive-Orig-date: Tue, 07 Feb 2006 00:58:20 GMT X-Archive-Orig-content-type: image/jpeg X-Archive-Orig-last-modified: Fri, 14 Feb 2003 23:56:30 GMT X-Archive-Orig-connection: close <other headers deleted>
  • 16. 2015IIPCGeneralAssembly PRIMA FACIE COHERENT 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 16 Bracket Pattern: Memento-Datetime + Last-Modified (yes, Last-Modified is sometimes wrong, but many of those cases can be detected)
  • 17. 2015IIPCGeneralAssembly PRIMA FACIE COHERENT 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 17 Equal Pattern: simultaneous capture (with an optionally tunable “bubble of simultaneity”)
  • 18. 2015IIPCGeneralAssembly PRIMA FACIE VIOLATIVE 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 18 Closest memento created and acquired after the root was acquired
  • 19. 2015IIPCGeneralAssembly POSSIBLY COHERENT 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 19 Closest (or only) memento captured before the root
  • 20. 2015IIPCGeneralAssembly PROBABLY VIOLATIVE 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 20 Closest (or only) memento captured after the root but no Last-Modified (possibly indicating a dynamically generated representations) (for both PC & PV, you could do content comparison if there are 2 mementos that straddle the root page)
  • 21. 2015IIPCGeneralAssembly CONTENTS  Motivation  Composite Memento  Coherence Framework  Temporal Coherence  Future Work  Conclusion 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 21
  • 22. 2015IIPCGeneralAssembly TEMPORAL COHERENCE 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 22
  • 23. 2015IIPCGeneralAssembly TEMPORAL COHERENCE 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 23 2005-05- 14 01:36:08 +9 days +18 days +18 days +7 months +2.1 years
  • 24. 2015IIPCGeneralAssembly EMBEDDED RESOURCES Resource Memento-Datetime Delta Resource Memento- Datetime Delta http://www.cs.odu.edu 2005-05-14 01:36:08 spacer.gif 2005-06-01 16:23:10 18.6 d mm_menu.js 2005-05-23 02:39:12 9.0 d jimcheng.gif 2005-06-01 16:37:39 18.6 d style.css 2005-05-23 02:39:39 9.0 d jsmith.gif 2005-06-01 16:58:50 18.6 d gfx-logo-odu-crown.gif 2005-05-23 02:39:39 9.0 d rmenu_1st_featured_alumni.png 2005-06-01 21:21:45 18.8 d ddmenu_ddown.js 2005-05-23 02:39:43 9.0 d hmenu_college_...-new.png 2005-12-21 20:14:25 7.3 mo university.js 2005-05-23 02:39:56 9.0 d rmenu_1st_upcoming_news.png 2005-12-21 20:15:14 7.3 mo rmenu_1st_about.png 2005-06-01 13:40:25 18.5 d rmenu_1st_upcoming_events.png 2005-12-21 21:01:12 7.3 mo rmenu_bottom_229.gif 2005-06-01 14:07:29 18.5 d lmenu_1st_resources.png 2005-12-28 17:47:41 7.5 mo shadow-bl.gif 2005-06-01 14:55:53 18.6 d bullet_blue_triangle.gif 2005-12-28 19:43:48 7.5 mo ecsbdg.jpg 2005-06-01 14:56:17 18.6 d logo-cs.gif 2005-12-28 19:54:29 7.5 mo shadow-br.gif 2005-06-01 15:18:18 18.6 d rmenu_1st_featured_student.png 2007-06-12 02:36:07 2.1 years gfx-btn-go-dblue.gif 2005-06-01 15:34:19 18.6 d shadow-b.gif 2007-06-21 02:35:17 2.1 years shadow-tr.gif 2005-06-01 15:55:57 18.6 d shadow-r.gif 404 Not Found header-right1.gif 2005-06-01 16:06:16 18.6 d 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 24 Embedded Resources 26 Mean Delta 125.9 days Standard Deviation 207.7 days Minimum Delta 9.0 days Maximum Delta 2.1 years
  • 25. 2015IIPCGeneralAssembly REPRESENTING COHERENCE 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 25
  • 26. 2015IIPCGeneralAssembly REPRESENTING COHERENCE 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 26
  • 27. 2015IIPCGeneralAssembly REPRESENTING COHERENCE 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 27
  • 28. 2015IIPCGeneralAssembly REPRESENTING COHERENCE 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 28
  • 29. 2015IIPCGeneralAssembly REPRESENTING COHERENCE 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 29
  • 30. 2015IIPCGeneralAssembly THE FULL CHART 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 30 2005-03-10
  • 31. 2015IIPCGeneralAssembly EXPERIMENT: DATA SET • 4,000 sample URI-Rs (data set from JCDL 2011) • Single and Multiple Archives • Two Heuristics: • Minimum distance (current default Wayback behavior) • choose closest Memento-Datetime • Bracket (proposed here) • use combination of Memento-Datetime + Last-Modified • Download all TimeMaps • Download all root mementos • Download all embedded resources 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 31
  • 32. 2015IIPCGeneralAssembly EXPERIMENT: SAMPLING • For each root URI-R TimeMap, choose a single memento per month • Extract embedded URI-Rs • Download TimeMaps for embedded URI-Rs • Download heuristically best URI-Ms • Repeat recursively 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 32
  • 33. 2015IIPCGeneralAssembly ROOT URI-R STATISTICS 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 33 Root URI-Rs archived 2,756 • 68.9% In multiple archives 1,180 • 29.5% Mean archives per URI-R 1.58 Mean mementos per URI-R 124.57 200 OK 82,425 • 93.6% 503 Service Unavailable 4,444 • 5.0% 404 Not found 583 • 0.7% 403 Forbidden 388 • 0.4% Others 214 • 0.3% URI-M Status Archival Data
  • 34. 2015IIPCGeneralAssembly EMBEDDED URI-R STATISTICS 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 34 Embedded URI-Rs 1,623,127 per root URI-M 19.7 Embedded URI-Ms available 1,332,993 • 93.6% per root URI-M 15.1 Not archived 312,641 • 83.9% 404 Not found 44,852 • 12.0% 403 Forbidden 6,116 • 1.6% 503 Service Unavailable 5,442 • 1.5% Others 3,508 • 0.9% URI-M Failure Reasons Archival Data
  • 35. 2015IIPCGeneralAssembly COMPOSITE MEMENTO (ROOT) COMPLETENESS & COHERENCE 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 35 Description MinDist Single MinDist Multi Bracket Single Bracket Multi Mean Complete 76.1% 80.2% 76.2% 80.3% Mean Missing 23.9% 19.8% 23.8% 19.7% Completeness (and Missing) Description MinDist Single MinDist Multi Bracket Single Bracket Multi Mean Prima Facie Coherent 41.0% 40.9% 54.7% 54.6% Mean Possibly Coherent 27.3% 28.7% 12.8% 14.2% Mean Probably Violative 2.5% 5.3% 2.5% 5.3% Mean Prima Facie Violative 5.3% 5.3% 6.2% 6.2% Coherence At least 5% of pages can be shown to have temporal violations! Multiple archives: +completeness, -coherence?
  • 36. 2015IIPCGeneralAssembly EMBEDDED MEMENTO COHERENCE 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 36 Description MinDist Single MinDist Multi Bracket Single Bracket Multi Prima Facie Coherent 622,565 621,447 864,736 859,625 Possibly Coherent 497,405 466,046 244,104 215,585 Probably Violative 104,376 53,734 104,339 53,694 Prima Facie Violative 100,760 103,662 114,062 117,469 Totals 1,325,106 1,244,889 1,327,241 1,246,373 Description MinDist Single MinDist Multi Bracket Single Bracket Multi Prima Facie Coherent 47.0% 49.9% 65.2% 69.0% Possibly Coherent 37.5% 37.4% 18.4% 17.3% Probably Violative 7.9% 4.3% 7.9% 4.3% Prima Facie Violative 7.6% 8.3% 8.6% 9.4% At least 7% of embedded resources are used violatively!
  • 37. 2015IIPCGeneralAssembly CONTENTS  Motivation  Related work  Preliminary work  Temporal Coherence  Future work  Conclusion 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 37
  • 38. 2015IIPCGeneralAssembly MINOR OR MAJOR VIOLATIONS? • This is a temporal violation. But is it meaningful? • How to judge? • Most archives transform HTML • Not all archives support export of original file • How to measure similarity on binary files? • early results: very few cases of equivalent binaries 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 38
  • 39. 2015IIPCGeneralAssembly HOW TO CONVEY COHERENCE? 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 39 How to scale to > 100 embedded mementos? How to convey coherence & contributing archive?
  • 40. 2015IIPCGeneralAssembly POLICIES & HEURISTICS • Tradeoffs: • Fast: minimize distance • Accurate: maximize coherence • Complete: query all (not just top k) archives in order to maximize completeness 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 40
  • 41. 2015IIPCGeneralAssembly CONTENTS  Motivation  Composite Memento  Coherence Framework  Temporal Coherence  Future Work  Conclusion 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 41
  • 42. 2015IIPCGeneralAssembly CONCLUSION • Defined four classes of temporal coherence for describing relationship between root & embedded mementos • Prima Facie {Coherent|Violative} • Possibly Coherent / Probably Violative • Determine classes using a combination of HTTP metadata, primarily Memento-Datetime & Last-Modified • At least 5% of IA pages have 1 or more temporal violations • Using multiple archives increases completeness, but with a possible loss of coherence • Determining semantic impact of violations and UI issues (status, policy choices) are areas of future research 4/27/15 Scott G. Ainsworth • Michael L. Nelson • Herbert Van de Sompel 42

Editor's Notes

  1. Please forgive the long title. Let me explain it with a fable…
  2. Please forgive the long title. Let me explain it with a fable…
  3. A composite memento as presented by the Internet Archive user interface.
  4. IA presents a single datetime. The datetime of the root memento. What cannot be seen is the datetimes of the embedded mementos. A few of the differences are shown. But are these coherent? Violative? Something in between?
  5. Look at this. The radar image was acquired 9 months after the root memento. Probably not the best recomposition of history!
  6. This leads to questions:
  7. The rest of this presentation will take the following form: A brief, informal description of composite mementos. An introduction of the framework we developed to help understand and evaluate temporal coherence. A review of the results. Closing comments.
  8. We call the collection of all mementos required to display a web page, a composite memento. A composite memento consists of a root and embedded mementos and can be represented as a tree. (It is actually a graph, but can be represented as a tree without loss of generality.) (CLICK) Which is represented as URI-M0 at the top of the tree on the right. Embedded mementos, such as images, are also represented in the tree. Embedded mementos can themselves have embedded mementos, for example HTML in a frame. (The ODU CS home page had frames in its 1990s versions, but no longer does.)
  9. In this section, we will introduce our temporal coherence framework: Define four coherence states Provide examples of patterns for each
  10. Consider 3 mementos, 1 root (black) and 2 embedded (gray and green). Explain time line. Explain T = Memento-Datetime Explain L = Last-Modified. The green embedded memento was acquired after the root. However, it’s Last-Modified is before the root acquisition datetime. Prima Facie Coherent
  11. A simpler case of prima facie coherence Simultaneous acquisition
  12. In this case, the red embedded memento was Acquired after the root and Created after the root Prima facie violative
  13. Consider 2 mementos, 1 root and 1 embedded. (EXPLAIN why there is only one) In this case the embedded memento was captured after the root (POINT OUT WHICH IS WHICH) Is this coherent? -- Hard to tell
  14. Consider 2 mementos, 1 root and 1 embedded. (EXPLAIN why there is only one) In this case the embedded memento was captured after the root (POINT OUT WHICH IS WHICH) Is this coherent? -- Hard to tell
  15. Let us return to temporal coherence. Most web pages are composed from multiple resources, some of which are circled here. (WAIT FOR ANIMATION)
  16. Even though the display is May 14, 2005 (CLICK) The resources are captured at very different times. (CLICK) Some days (CLICK) Some months (CLICK) Even years (in this case an image in the footer)
  17. This is a list of all the mementos that comprise http://www.cs.odu.edu. It is a bit of an eye chart, so here is a summary (CLICK) There are 26 embedded mements (27 total including the root) The mean delta (distance from root) is 125.9 days. The standard deviation is 207.7 – which does not bode well for the mean. Here’s a kicker – the spread is 2.1 years!
  18. We’ll start with a subset of three composite mementos for rpgdreamers.com. This chart shows the roots. Each is plotted at their Memento-Datetime on the vertical axis. On the horizontal axis, they all align at zero. The horizontal axis represtents the delta between an embedded memento and corresponding root. Since these are all roots, they all have zero delta.
  19. Next we add the prima facie coherent embedded mementos. These were all captured after the root but were created before the root (Last-Modified before the root) Note: Only the first Last-Modified is shown to avoid clutter.
  20. Next we add the possibly coherent embedded mementos. These were all captured before the root (and Last-Modified does not matter). Because these were captured before the root, it is likely that they continued to exist when the root was acquired.
  21. Next we add the probably violative embedded mementos. These were all captured afterthe root but have no Last-Modified. We cannot prove these are prima facie violative And they might be prima facie coherent. But since there is no evidence they existed when the root was acquired, we consider them probably violative.
  22. Finally, we add the prima facie violative embedded mementos. These were all captured after the root and have Last-Modified that is also after the root.
  23. Here is the full chart for rpgdreamers.com Note the 2005-03-10 composite memento (arrow and dashed line). Of particular note is the media.gif resource Which has only a single memento Which was captured nearly seven years after the root!
  24. This slide characterizes composite mementos (The next slide characterizes embedded mementos.) On average, each composite memento is 76.1% - 80.3% complete (i.e., 76.1% - 80.3% of eURI-Rs have an available eURI-M). (Mean missing is the additive inverse) The average composite memento has: 40.9% - 54.7% of its eURI-Ms are prima facie coherent 12.8% - 27.3% of its eURI-Ms are possibly coherent 2.5% - 5.3% of its eURI-Ms are probably violative 5.3% - 6.2% of its eURI-Ms are prima facie violative
  25. Statistics on eURI-Ms considered individually. This is different than the previous slide, which considered eURI-Ms in the context of their composite mementos. (Note that the percentages on the previous slide cannot be computed from these counts!) Clearly, Bracket improves PFC selection – by 13.7%. However, PC selection is decreased by 14.5%. And, the selection of PFV also increased! --- Why the apparent reduction in coherence? --- Bracket heuristic makes two changes when Last-Modified is available: The bracket favors PFC over PC increasing the selection of PFC mementos. Minimum distance is computed using either Memento-Datetime or Last-Modified, whichever gives the best result. Thus, when a bracket is not available, the introduction of Last-Modified increases the likelihood of PFV selection.
  26. The rest of this presentation will take the following form: A brief discussion of related work and how this research improves our knowledge. Describe how we measured drift? A review of the results. A quick look at how this work can be refined.
  27. Delta is the absolute value of the difference between the room and embedded Memento-Datetimes. But there are other conditions that could or should indicate a delta of 0 instead. These all revolve around determining that no change has occurred. One of these is bracketing mementos. Explain the chart… However, HTML is problematic because comments are added by the archives. So, we cant check for equality. What similarity measure or measures are reasonable substitutes for equality.
  28. (changed –mln)
  29. Finally, policies and heuristics must be developed. For example, in the drift work we used sliding and sticky
  30. The rest of this presentation will take the following form: A brief, informal description of composite mementos. An introduction of the framework we developed to help understand and evaluate temporal coherence. A review of the results. Closing comments.