Digital Preservation Research
at Old Dominion University
Justin F. Brunelle
The MITRE Corporation
Old Dominion University
(And hopefully MITRE, soon)
Why are we listening?
• Overview of the problem
• BRIEF introduction to ODU WSDL group
research
• Memento
• I’ll be skipping around, so don’t hesitate to
interrupt me
Digital Preservation
• Using the past Web
– Focus of our research
• Temporal Browsing
– Sessions in the past
• Recovering Lost Pages
– Is it really gone?
• 404s
– How to fix broken links?
1
same URI
maps to same
or very similar
content at a
later time
2
same URI
maps to
different
content at a
later time
3
different URI
maps to same
or very similar
content at the
same or at a
later time
4
the content
can not be
found at
any URI
U1
C1
U1
C1
timeA B
U1
C2
U1
C1
timeA B
U2
C1
U1
C1
U1
404
timeA B
U1
??
U1
C1
timeA B
Change on the Web
Time to Talk About Saving
Everything?
Dinner for one or two costs more than 1TB disk Wikis have popularized versioning
Cool URIs (http://www.w3.org/Provider/Style/URI.html) are widely adopted, e.g.:
http://news.yahoo.com/s/ap/20100920/ap_on_el_se/us_alaska_senate
http://d.yimg.com/a/p/ap/20100918/capt.67567dbc0a874b689f0b4a5c392f379c-67567dbc0a874b689f0b4a5c392f379c-0.jpg
http://d.yimg.com/a/p/afp/20100918/thumb.photo_1284846332993-1-0.jpg
Also related projects with cool URI / permalink focus:
http://www.citability.org/
http://data.gov/
http://data.gov.uk/
Fortress Model
• Get a lot of money
• Buy lots of storage
• Hire lots of people
• “Look upon my archive ye Mighty, and
despair!”
Alternate Methods
• Lazy Preservation (McCown)
– “How much preservation do I get if I do absolutely
nothing?”
• Just-In-Time Preservation (Klein)
– Wait for it to disappear, then find a “good ‘nuff”
version
• Shared Infrastructure Preservation
– Push content to sites that might preserve it
• arXiv.org, IA, WebCite…
• Server Enhanced Preservation
– Create archival-ready resources
And Soon…
• Social Preservation
– Preserving resources using 3rd
party Web Services
– Repository for OAI-ORE ReMs
– Social network feel
– Lazy-esque, server-side reconstruction
But I digress…
• Few years away…
• Preliminary research
• And now back to the prior research…
Web Infrastructure (McCown, 2007)
WayBack Machine
http://web.archive.org/web/*/http://www.thecribs.com/
http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://www.thecribs.com/
from these we can
create time-based:
• indexes
• IDF values
• PageRank
Batch Recovery For Sites
http://warrick.cs.odu.edu/
Free limo rides for life?!
13
Reconstruction Diagram
added
20%
identical
50%
changed
33%
missing
17%
Real-Time Recovery for URIs
Synchronicity - www.cs.odu.edu/~mklein/
Memento wants to make navigating the
Web’s Past Easy
15
http://www.mementoweb.org
http://groups.google.com/group/memento-dev
What are you talking about?
• Universal Resource Identifier (URI) ~= URL
• Resource:
– <HTML>
• Representation
W3C Web Architecture: Resource –
URI - Representation
Resource
Representation
Represents
URI
Identifies
dereference
17
dereference content negotiation
W3C Web Architecture: Resource –
URI - Representation
Resource
URI
Identifies
Representation 1
Represents
Representation 2Represents
18
Resources
19
Resources have Representations
20
Resources have Representations that
Change over Time
21
Only the Current Representation is
Available from a Resource
22
Old Representations are Lost
Forever
23
Finding Archived Resources
Go to http://www.archive.org/ and search
http://cnn.com
On http://web.archive.org/web/*/http://cnn.com, select
desired datetime
24
Archived Resources
http://web.archive.org/web/20010911203610/http://www.c
nn.com/ archived resource for http://cnn.com
http://en.wikipedia.org/w/index.php?
title=September_11_attacks&oldid=282333 archived
resource for
http://en.wikipedia.org/wiki/September_11_attacks
Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC
25
Navigating Archived Resources
http://en.wikipedia.org/w/index.php?
title=September_11_attacks&oldid=282333 archived
resource for
http://en.wikipedia.org/wiki/September_11_attacks3
Dec 20 2001, 4:51:00 UTC
http://en.wikipedia.org/wiki/The_Pentagon
current
Pentagon
26
Current and Past Web are Not
Integrated
27
• Current and Past Web based on
same technology.
• But, going from Current to
Past Web is a matter of (manual)
discovery.
• Memento wants to make going
from Current to Past Web a
(HTTP) protocol matter.
• Memento wants to integrate
Current And Past Web.
One Memento HTTP Navigation
28
Memento HTTP Flow
HEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
One Memento HTTP Navigation
30
Scenario
• cnn.com includes Link to TimeGate at Internet Archive
• URI-R on one server, URI-G & URI-M on another
Memento HTTP Flow
HEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
Memento HTTP Flow: URI-R
HEAD R, Accept-Datetime
HEAD http://cnn.com/ HTTP/1.1
Host: cnn.com
Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT
Connection: close
32
Memento HTTP Flow
HEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
Memento HTTP Flow: Success –
URI-R
LinkG
HTTP/1.1 200 OK
Date: Thu, 21 Jan 2010 00:02:12 GMT
Server: Apache
Link: <http://web.archive.org/web/timegate/http://cnn.com>; rel="timegate"
Content-Length: 255
Connection: close
Content-Type: text/html; charset=iso-8859-1
34
Memento HTTP Flow
HEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
GET G, Accept-Datetime
Memento HTTP Flow: URI-G
GET http://web.archive.org/web/timegate/http://cnn.com HTTP/1.1
Host: web.archive.org
Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT
Connection: close
36
Memento HTTP Flow
HEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
Memento HTTP Flow: Success –
URI-G
302M, Vary, LinkR,B,M
HTTP/1.1 302 Found
Date: Thu, 21 Jan 2010 00:06:50 GMT
Server: Apache
TCN: choice
Vary: negotiate, accept-datetime
Location: http://web.archive.org/web/20010911203610/http://www.cnn.com
Link: <http://cnn.com/>; rel="original",
<http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”,
<http://web.archive.org/web/20000915112826/http://www.cnn.com>;
rel=“first- memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”,
<http://web.archive.org/web/20080708093433/http://www.cnn.com>;
rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”,
<http://web.archive.org/web/20010911203610/http://www.cnn.com>;
rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”,
<http://web.archive.org/web/20010911203610/http://www.cnn.com>;
rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT”
Content-Length: 0
Connection: close
Content-Type: text/plain; charset=UTF-8
38
Memento HTTP Flow
HEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
GET M, Accept-Datetime
Memento HTTP Flow: URI-M
GET http://web.archive.org/web/20010911203610/http://www.cnn.com HTTP/1.1
Host: web.archive.org
Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT
Connection: close
40
Memento HTTP Flow
HEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
Memento HTTP Flow: Success –
URI-M
200, Content-Datetime, LinkR,B,M
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
X-Archive-Orig-Accept-Ranges: bytes
…
Content-Type: text/html;charset=utf-8
Content-Length: 23364
Date: Thu, 21 Jan 2010 00:09:40 GMT
Content-Datetime: Tue, 11 Sep 2001 20:36:10 GMT
Link: <http://cnn.com/>; rel="original",
<http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”,
<http://web.archive.org/web/20000915112826/http://www.cnn.com>;
rel=“first-memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”,
<http://web.archive.org/web/20080708093433/http://www.cnn.com>;
rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”,
<http://web.archive.org/web/20010911203610/http://www.cnn.com>;
rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”,
<http://web.archive.org/web/20010911203610/http://www.cnn.com>;
rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT”
Connection: close
What does it all mean?
• Cutting edge technology
• Existing Infrastructure
• Redefining Web surfing
• MAJOR “real world” implications
Closing Thoughts
Preservation not for
privileged priesthood
http://doi.acm.org/10.1145/1592761.1592794
http://booktwo.org/notebook/wikipedia-historiography/
no more hoary stories
about format obsolescence:
http://blog.dshr.org/2010/09/reinforcing-my-point.html
Don't dessicate resources;
leave them on the web
Endless metadata is not
preservation…
archiving as branded service,
not infrastructure
http://blog.dshr.org/2010/06/jcdl-2010-keynote.html
Acknowledgements
• Slides borrowed from:
• Dr. Michael L. Nelson:
– http://www.slideshare.net/phonedude/my-point-of-view-
michael-l-nelson-web-archiving-cooperative
– http://www.slideshare.net/phonedude/review-of-web-
archiving
– http://www.slideshare.net/phonedude/memento-time-
travel-for-the-web
• Martin Klein:
– http://www.slideshare.net/phonedude/synchronicity-
justintime-discovery-of-lost-web-pages

Digital Preservation - ODU

  • 1.
    Digital Preservation Research atOld Dominion University Justin F. Brunelle The MITRE Corporation Old Dominion University (And hopefully MITRE, soon)
  • 2.
    Why are welistening? • Overview of the problem • BRIEF introduction to ODU WSDL group research • Memento • I’ll be skipping around, so don’t hesitate to interrupt me
  • 3.
    Digital Preservation • Usingthe past Web – Focus of our research • Temporal Browsing – Sessions in the past • Recovering Lost Pages – Is it really gone? • 404s – How to fix broken links?
  • 4.
    1 same URI maps tosame or very similar content at a later time 2 same URI maps to different content at a later time 3 different URI maps to same or very similar content at the same or at a later time 4 the content can not be found at any URI U1 C1 U1 C1 timeA B U1 C2 U1 C1 timeA B U2 C1 U1 C1 U1 404 timeA B U1 ?? U1 C1 timeA B Change on the Web
  • 5.
    Time to TalkAbout Saving Everything? Dinner for one or two costs more than 1TB disk Wikis have popularized versioning Cool URIs (http://www.w3.org/Provider/Style/URI.html) are widely adopted, e.g.: http://news.yahoo.com/s/ap/20100920/ap_on_el_se/us_alaska_senate http://d.yimg.com/a/p/ap/20100918/capt.67567dbc0a874b689f0b4a5c392f379c-67567dbc0a874b689f0b4a5c392f379c-0.jpg http://d.yimg.com/a/p/afp/20100918/thumb.photo_1284846332993-1-0.jpg Also related projects with cool URI / permalink focus: http://www.citability.org/ http://data.gov/ http://data.gov.uk/
  • 6.
    Fortress Model • Geta lot of money • Buy lots of storage • Hire lots of people • “Look upon my archive ye Mighty, and despair!”
  • 7.
    Alternate Methods • LazyPreservation (McCown) – “How much preservation do I get if I do absolutely nothing?” • Just-In-Time Preservation (Klein) – Wait for it to disappear, then find a “good ‘nuff” version • Shared Infrastructure Preservation – Push content to sites that might preserve it • arXiv.org, IA, WebCite… • Server Enhanced Preservation – Create archival-ready resources
  • 8.
    And Soon… • SocialPreservation – Preserving resources using 3rd party Web Services – Repository for OAI-ORE ReMs – Social network feel – Lazy-esque, server-side reconstruction
  • 9.
    But I digress… •Few years away… • Preliminary research • And now back to the prior research…
  • 10.
  • 11.
  • 12.
    Batch Recovery ForSites http://warrick.cs.odu.edu/ Free limo rides for life?!
  • 13.
  • 14.
    Real-Time Recovery forURIs Synchronicity - www.cs.odu.edu/~mklein/
  • 15.
    Memento wants tomake navigating the Web’s Past Easy 15 http://www.mementoweb.org http://groups.google.com/group/memento-dev
  • 16.
    What are youtalking about? • Universal Resource Identifier (URI) ~= URL • Resource: – <HTML> • Representation
  • 17.
    W3C Web Architecture:Resource – URI - Representation Resource Representation Represents URI Identifies dereference 17
  • 18.
    dereference content negotiation W3CWeb Architecture: Resource – URI - Representation Resource URI Identifies Representation 1 Represents Representation 2Represents 18
  • 19.
  • 20.
  • 21.
    Resources have Representationsthat Change over Time 21
  • 22.
    Only the CurrentRepresentation is Available from a Resource 22
  • 23.
    Old Representations areLost Forever 23
  • 24.
    Finding Archived Resources Goto http://www.archive.org/ and search http://cnn.com On http://web.archive.org/web/*/http://cnn.com, select desired datetime 24
  • 25.
    Archived Resources http://web.archive.org/web/20010911203610/http://www.c nn.com/ archivedresource for http://cnn.com http://en.wikipedia.org/w/index.php? title=September_11_attacks&oldid=282333 archived resource for http://en.wikipedia.org/wiki/September_11_attacks Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC 25
  • 26.
    Navigating Archived Resources http://en.wikipedia.org/w/index.php? title=September_11_attacks&oldid=282333archived resource for http://en.wikipedia.org/wiki/September_11_attacks3 Dec 20 2001, 4:51:00 UTC http://en.wikipedia.org/wiki/The_Pentagon current Pentagon 26
  • 27.
    Current and PastWeb are Not Integrated 27 • Current and Past Web based on same technology. • But, going from Current to Past Web is a matter of (manual) discovery. • Memento wants to make going from Current to Past Web a (HTTP) protocol matter. • Memento wants to integrate Current And Past Web.
  • 28.
    One Memento HTTPNavigation 28
  • 29.
    Memento HTTP Flow HEADR, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 30.
    One Memento HTTPNavigation 30 Scenario • cnn.com includes Link to TimeGate at Internet Archive • URI-R on one server, URI-G & URI-M on another
  • 31.
    Memento HTTP Flow HEADR, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 32.
    Memento HTTP Flow:URI-R HEAD R, Accept-Datetime HEAD http://cnn.com/ HTTP/1.1 Host: cnn.com Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close 32
  • 33.
    Memento HTTP Flow HEADR, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 34.
    Memento HTTP Flow:Success – URI-R LinkG HTTP/1.1 200 OK Date: Thu, 21 Jan 2010 00:02:12 GMT Server: Apache Link: <http://web.archive.org/web/timegate/http://cnn.com>; rel="timegate" Content-Length: 255 Connection: close Content-Type: text/html; charset=iso-8859-1 34
  • 35.
    Memento HTTP Flow HEADR, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 36.
    GET G, Accept-Datetime MementoHTTP Flow: URI-G GET http://web.archive.org/web/timegate/http://cnn.com HTTP/1.1 Host: web.archive.org Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close 36
  • 37.
    Memento HTTP Flow HEADR, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 38.
    Memento HTTP Flow:Success – URI-G 302M, Vary, LinkR,B,M HTTP/1.1 302 Found Date: Thu, 21 Jan 2010 00:06:50 GMT Server: Apache TCN: choice Vary: negotiate, accept-datetime Location: http://web.archive.org/web/20010911203610/http://www.cnn.com Link: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first- memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT” Content-Length: 0 Connection: close Content-Type: text/plain; charset=UTF-8 38
  • 39.
    Memento HTTP Flow HEADR, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 40.
    GET M, Accept-Datetime MementoHTTP Flow: URI-M GET http://web.archive.org/web/20010911203610/http://www.cnn.com HTTP/1.1 Host: web.archive.org Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close 40
  • 41.
    Memento HTTP Flow HEADR, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 42.
    Memento HTTP Flow:Success – URI-M 200, Content-Datetime, LinkR,B,M HTTP/1.1 200 OK Server: Apache-Coyote/1.1 X-Archive-Orig-Accept-Ranges: bytes … Content-Type: text/html;charset=utf-8 Content-Length: 23364 Date: Thu, 21 Jan 2010 00:09:40 GMT Content-Datetime: Tue, 11 Sep 2001 20:36:10 GMT Link: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first-memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT” Connection: close
  • 43.
    What does itall mean? • Cutting edge technology • Existing Infrastructure • Redefining Web surfing • MAJOR “real world” implications
  • 44.
    Closing Thoughts Preservation notfor privileged priesthood http://doi.acm.org/10.1145/1592761.1592794 http://booktwo.org/notebook/wikipedia-historiography/ no more hoary stories about format obsolescence: http://blog.dshr.org/2010/09/reinforcing-my-point.html Don't dessicate resources; leave them on the web Endless metadata is not preservation… archiving as branded service, not infrastructure http://blog.dshr.org/2010/06/jcdl-2010-keynote.html
  • 45.
    Acknowledgements • Slides borrowedfrom: • Dr. Michael L. Nelson: – http://www.slideshare.net/phonedude/my-point-of-view- michael-l-nelson-web-archiving-cooperative – http://www.slideshare.net/phonedude/review-of-web- archiving – http://www.slideshare.net/phonedude/memento-time- travel-for-the-web • Martin Klein: – http://www.slideshare.net/phonedude/synchronicity- justintime-discovery-of-lost-web-pages