Digital Preservation at ODU

Digital Preservation Research
at Old Dominion University
Justin F. Brunelle
The MITRE Corporation
Old Dominion University
(And hopefully MITRE, soon)

Why are we listening?
• Overview of the problem
• BRIEF introduction to ODU WSDL group
research
• Memento
• I’ll be skipping around, so don’t hesitate to
interrupt me

Digital Preservation
• Using the past Web
– Focus of our research
• Temporal Browsing
– Sessions in the past
• Recovering Lost Pages
– Is it really gone?
• 404s
– How to fix broken links?

1
same URI
maps to same
or very similar
content at a
later time
2
same URI
maps to
different
content at a
later time
3
different URI
maps to same
or very similar
content at the
same or at a
later time
4
the content
can not be
found at
any URI
U1
C1
U1
C1
timeA B
U1
C2
U1
C1
timeA B
U2
C1
U1
C1
U1
404
timeA B
U1
??
U1
C1
timeA B
Change on the Web

Time to Talk About Saving
Everything?
Dinner for one or two costs more than 1TB disk Wikis have popularized versioning
Cool URIs (http://www.w3.org/Provider/Style/URI.html) are widely adopted, e.g.:
http://news.yahoo.com/s/ap/20100920/ap_on_el_se/us_alaska_senate
http://d.yimg.com/a/p/ap/20100918/capt.67567dbc0a874b689f0b4a5c392f379c-67567dbc0a874b689f0b4a5c392f379c-0.jpg
http://d.yimg.com/a/p/afp/20100918/thumb.photo_1284846332993-1-0.jpg
Also related projects with cool URI / permalink focus:
http://www.citability.org/
http://data.gov/
http://data.gov.uk/

Fortress Model
• Get a lot of money
• Buy lots of storage
• Hire lots of people
• “Look upon my archive ye Mighty, and
despair!”

Alternate Methods
• Lazy Preservation (McCown)
– “How much preservation do I get if I do absolutely
nothing?”
• Just-In-Time Preservation (Klein)
– Wait for it to disappear, then find a “good ‘nuff”
version
• Shared Infrastructure Preservation
– Push content to sites that might preserve it
• arXiv.org, IA, WebCite…
• Server Enhanced Preservation
– Create archival-ready resources

And Soon…
• Social Preservation
– Preserving resources using 3rd
party Web Services
– Repository for OAI-ORE ReMs
– Social network feel
– Lazy-esque, server-side reconstruction

But I digress…
• Few years away…
• Preliminary research
• And now back to the prior research…

Web Infrastructure (McCown, 2007)

WayBack Machine
http://web.archive.org/web/*/http://www.thecribs.com/
http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://www.thecribs.com/
from these we can
create time-based:
• indexes
• IDF values
• PageRank

Batch Recovery For Sites
http://warrick.cs.odu.edu/
Free limo rides for life?!

13
Reconstruction Diagram
added
20%
identical
50%
changed
33%
missing
17%

Real-Time Recovery for URIs
Synchronicity - www.cs.odu.edu/~mklein/

Memento wants to make navigating the
Web’s Past Easy
15
http://www.mementoweb.org
http://groups.google.com/group/memento-dev

What are you talking about?
• Universal Resource Identifier (URI) ~= URL
• Resource:
– <HTML>
• Representation

W3C Web Architecture: Resource –
URI - Representation
Resource
Representation
Represents
URI
Identifies
dereference
17

dereference content negotiation
W3C Web Architecture: Resource –
URI - Representation
Resource
URI
Identifies
Representation 1
Represents
Representation 2Represents
18

Resources have Representations
20

Resources have Representations that
Change over Time
21

Only the Current Representation is
Available from a Resource
22

Old Representations are Lost
Forever
23

Finding Archived Resources
Go to http://www.archive.org/ and search
http://cnn.com
On http://web.archive.org/web/*/http://cnn.com, select
desired datetime
24

Archived Resources
http://web.archive.org/web/20010911203610/http://www.c
nn.com/ archived resource for http://cnn.com
http://en.wikipedia.org/w/index.php?
title=September_11_attacks&oldid=282333 archived
resource for
http://en.wikipedia.org/wiki/September_11_attacks
Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC
25

Navigating Archived Resources
http://en.wikipedia.org/w/index.php?
title=September_11_attacks&oldid=282333 archived
resource for
http://en.wikipedia.org/wiki/September_11_attacks3
Dec 20 2001, 4:51:00 UTC
http://en.wikipedia.org/wiki/The_Pentagon
current
Pentagon
26

Current and Past Web are Not
Integrated
27
• Current and Past Web based on
same technology.
• But, going from Current to
Past Web is a matter of (manual)
discovery.
• Memento wants to make going
from Current to Past Web a
(HTTP) protocol matter.
• Memento wants to integrate
Current And Past Web.

One Memento HTTP Navigation
28

Memento HTTP Flow
HEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime

One Memento HTTP Navigation
30
Scenario
• cnn.com includes Link to TimeGate at Internet Archive
• URI-R on one server, URI-G & URI-M on another

Memento HTTP Flow: URI-R
HEAD R, Accept-Datetime
HEAD http://cnn.com/ HTTP/1.1
Host: cnn.com
Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT
Connection: close
32

Memento HTTP Flow: Success –
URI-R
LinkG
HTTP/1.1 200 OK
Date: Thu, 21 Jan 2010 00:02:12 GMT
Server: Apache
Link: <http://web.archive.org/web/timegate/http://cnn.com>; rel="timegate"
Content-Length: 255
Connection: close
Content-Type: text/html; charset=iso-8859-1
34

GET G, Accept-Datetime
Memento HTTP Flow: URI-G
GET http://web.archive.org/web/timegate/http://cnn.com HTTP/1.1
Host: web.archive.org
Connection: close
36

URI-G
302M, Vary, LinkR,B,M
HTTP/1.1 302 Found
Date: Thu, 21 Jan 2010 00:06:50 GMT
Server: Apache
TCN: choice
Vary: negotiate, accept-datetime
Location: http://web.archive.org/web/20010911203610/http://www.cnn.com
Link: <http://cnn.com/>; rel="original",
<http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”,
<http://web.archive.org/web/20000915112826/http://www.cnn.com>;
rel=“first- memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”,
rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”,
rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”,
rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT”
Content-Length: 0
Connection: close
Content-Type: text/plain; charset=UTF-8
38

GET M, Accept-Datetime
Memento HTTP Flow: URI-M
GET http://web.archive.org/web/20010911203610/http://www.cnn.com HTTP/1.1
Host: web.archive.org
Connection: close
40

URI-M
200, Content-Datetime, LinkR,B,M
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
X-Archive-Orig-Accept-Ranges: bytes
…
Content-Type: text/html;charset=utf-8
Content-Length: 23364
Date: Thu, 21 Jan 2010 00:09:40 GMT
Content-Datetime: Tue, 11 Sep 2001 20:36:10 GMT
Link: <http://cnn.com/>; rel="original",
<http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”,
rel=“first-memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”,
rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”,
rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”,
rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT”
Connection: close

What does it all mean?
• Cutting edge technology
• Existing Infrastructure
• Redefining Web surfing
• MAJOR “real world” implications

Closing Thoughts
Preservation not for
privileged priesthood
http://doi.acm.org/10.1145/1592761.1592794
http://booktwo.org/notebook/wikipedia-historiography/
no more hoary stories
about format obsolescence:
http://blog.dshr.org/2010/09/reinforcing-my-point.html
Don't dessicate resources;
leave them on the web
Endless metadata is not
preservation…
archiving as branded service,
not infrastructure
http://blog.dshr.org/2010/06/jcdl-2010-keynote.html

Acknowledgements
• Slides borrowed from:
• Dr. Michael L. Nelson:
– http://www.slideshare.net/phonedude/my-point-of-view-
michael-l-nelson-web-archiving-cooperative
– http://www.slideshare.net/phonedude/review-of-web-
archiving
– http://www.slideshare.net/phonedude/memento-time-
travel-for-the-web
• Martin Klein:
– http://www.slideshare.net/phonedude/synchronicity-
justintime-discovery-of-lost-web-pages

Digital Preservation at ODU

More Related Content

Viewers also liked

Similar to Digital Preservation at ODU

Digital Preservation at ODU