Your SlideShare is downloading. ×
0
Digital Preservation Research
at Old Dominion University
Justin F. Brunelle
The MITRE Corporation
Old Dominion University
...
Why are we listening?
• Overview of the problem
• BRIEF introduction to ODU WSDL group
research
• Memento
• I’ll be skippi...
Digital Preservation
• Using the past Web
– Focus of our research
• Temporal Browsing
– Sessions in the past
• Recovering ...
1
same URI
maps to same
or very similar
content at a
later time
2
same URI
maps to
different
content at a
later time
3
dif...
Time to Talk About Saving
Everything?
Dinner for one or two costs more than 1TB disk Wikis have popularized versioning
Coo...
Fortress Model
• Get a lot of money
• Buy lots of storage
• Hire lots of people
• “Look upon my archive ye Mighty, and
des...
Alternate Methods
• Lazy Preservation (McCown)
– “How much preservation do I get if I do absolutely
nothing?”
• Just-In-Ti...
And Soon…
• Social Preservation
– Preserving resources using 3rd
party Web Services
– Repository for OAI-ORE ReMs
– Social...
But I digress…
• Few years away…
• Preliminary research
• And now back to the prior research…
Web Infrastructure (McCown, 2007)
WayBack Machine
http://web.archive.org/web/*/http://www.thecribs.com/
http://mementoproxy.cs.odu.edu/aggr/timemap/link/htt...
Batch Recovery For Sites
http://warrick.cs.odu.edu/
Free limo rides for life?!
13
Reconstruction Diagram
added
20%
identical
50%
changed
33%
missing
17%
Real-Time Recovery for URIs
Synchronicity - www.cs.odu.edu/~mklein/
Memento wants to make navigating the
Web’s Past Easy
15
http://www.mementoweb.org
http://groups.google.com/group/memento-d...
What are you talking about?
• Universal Resource Identifier (URI) ~= URL
• Resource:
– <HTML>
• Representation
W3C Web Architecture: Resource –
URI - Representation
Resource
Representation
Represents
URI
Identifies
dereference
17
dereference content negotiation
W3C Web Architecture: Resource –
URI - Representation
Resource
URI
Identifies
Representati...
Resources
19
Resources have Representations
20
Resources have Representations that
Change over Time
21
Only the Current Representation is
Available from a Resource
22
Old Representations are Lost
Forever
23
Finding Archived Resources
Go to http://www.archive.org/ and search
http://cnn.com
On http://web.archive.org/web/*/http://...
Archived Resources
http://web.archive.org/web/20010911203610/http://www.c
nn.com/ archived resource for http://cnn.com
htt...
Navigating Archived Resources
http://en.wikipedia.org/w/index.php?
title=September_11_attacks&oldid=282333 archived
resour...
Current and Past Web are Not
Integrated
27
• Current and Past Web based on
same technology.
• But, going from Current to
P...
One Memento HTTP Navigation
28
Memento HTTP Flow
HEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Acc...
One Memento HTTP Navigation
30
Scenario
• cnn.com includes Link to TimeGate at Internet Archive
• URI-R on one server, URI...
Memento HTTP Flow
HEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Acc...
Memento HTTP Flow: URI-R
HEAD R, Accept-Datetime
HEAD http://cnn.com/ HTTP/1.1
Host: cnn.com
Accept-Datetime: Tue, 11 Sep ...
Memento HTTP Flow
HEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Acc...
Memento HTTP Flow: Success –
URI-R
LinkG
HTTP/1.1 200 OK
Date: Thu, 21 Jan 2010 00:02:12 GMT
Server: Apache
Link: <http:/...
Memento HTTP Flow
HEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Acc...
GET G, Accept-Datetime
Memento HTTP Flow: URI-G
GET http://web.archive.org/web/timegate/http://cnn.com HTTP/1.1
Host: web....
Memento HTTP Flow
HEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Acc...
Memento HTTP Flow: Success –
URI-G
302M, Vary, LinkR,B,M
HTTP/1.1 302 Found
Date: Thu, 21 Jan 2010 00:06:50 GMT
Server: ...
Memento HTTP Flow
HEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Acc...
GET M, Accept-Datetime
Memento HTTP Flow: URI-M
GET http://web.archive.org/web/20010911203610/http://www.cnn.com HTTP/1.1
...
Memento HTTP Flow
HEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Acc...
Memento HTTP Flow: Success –
URI-M
200, Content-Datetime, LinkR,B,M
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
X-Archive-O...
What does it all mean?
• Cutting edge technology
• Existing Infrastructure
• Redefining Web surfing
• MAJOR “real world” i...
Closing Thoughts
Preservation not for
privileged priesthood
http://doi.acm.org/10.1145/1592761.1592794
http://booktwo.org/...
Acknowledgements
• Slides borrowed from:
• Dr. Michael L. Nelson:
– http://www.slideshare.net/phonedude/my-point-of-view-
...
Upcoming SlideShare
Loading in...5
×

Digital Preservation - ODU

809

Published on

This is the slide deck of the presentation given to the RRAC national group meeting on 10-20-2010. It is a summary of the research efforts in Digital Preservation at ODU.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
809
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Digital Preservation - ODU"

  1. 1. Digital Preservation Research at Old Dominion University Justin F. Brunelle The MITRE Corporation Old Dominion University (And hopefully MITRE, soon)
  2. 2. Why are we listening? • Overview of the problem • BRIEF introduction to ODU WSDL group research • Memento • I’ll be skipping around, so don’t hesitate to interrupt me
  3. 3. Digital Preservation • Using the past Web – Focus of our research • Temporal Browsing – Sessions in the past • Recovering Lost Pages – Is it really gone? • 404s – How to fix broken links?
  4. 4. 1 same URI maps to same or very similar content at a later time 2 same URI maps to different content at a later time 3 different URI maps to same or very similar content at the same or at a later time 4 the content can not be found at any URI U1 C1 U1 C1 timeA B U1 C2 U1 C1 timeA B U2 C1 U1 C1 U1 404 timeA B U1 ?? U1 C1 timeA B Change on the Web
  5. 5. Time to Talk About Saving Everything? Dinner for one or two costs more than 1TB disk Wikis have popularized versioning Cool URIs (http://www.w3.org/Provider/Style/URI.html) are widely adopted, e.g.: http://news.yahoo.com/s/ap/20100920/ap_on_el_se/us_alaska_senate http://d.yimg.com/a/p/ap/20100918/capt.67567dbc0a874b689f0b4a5c392f379c-67567dbc0a874b689f0b4a5c392f379c-0.jpg http://d.yimg.com/a/p/afp/20100918/thumb.photo_1284846332993-1-0.jpg Also related projects with cool URI / permalink focus: http://www.citability.org/ http://data.gov/ http://data.gov.uk/
  6. 6. Fortress Model • Get a lot of money • Buy lots of storage • Hire lots of people • “Look upon my archive ye Mighty, and despair!”
  7. 7. Alternate Methods • Lazy Preservation (McCown) – “How much preservation do I get if I do absolutely nothing?” • Just-In-Time Preservation (Klein) – Wait for it to disappear, then find a “good ‘nuff” version • Shared Infrastructure Preservation – Push content to sites that might preserve it • arXiv.org, IA, WebCite… • Server Enhanced Preservation – Create archival-ready resources
  8. 8. And Soon… • Social Preservation – Preserving resources using 3rd party Web Services – Repository for OAI-ORE ReMs – Social network feel – Lazy-esque, server-side reconstruction
  9. 9. But I digress… • Few years away… • Preliminary research • And now back to the prior research…
  10. 10. Web Infrastructure (McCown, 2007)
  11. 11. WayBack Machine http://web.archive.org/web/*/http://www.thecribs.com/ http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://www.thecribs.com/ from these we can create time-based: • indexes • IDF values • PageRank
  12. 12. Batch Recovery For Sites http://warrick.cs.odu.edu/ Free limo rides for life?!
  13. 13. 13 Reconstruction Diagram added 20% identical 50% changed 33% missing 17%
  14. 14. Real-Time Recovery for URIs Synchronicity - www.cs.odu.edu/~mklein/
  15. 15. Memento wants to make navigating the Web’s Past Easy 15 http://www.mementoweb.org http://groups.google.com/group/memento-dev
  16. 16. What are you talking about? • Universal Resource Identifier (URI) ~= URL • Resource: – <HTML> • Representation
  17. 17. W3C Web Architecture: Resource – URI - Representation Resource Representation Represents URI Identifies dereference 17
  18. 18. dereference content negotiation W3C Web Architecture: Resource – URI - Representation Resource URI Identifies Representation 1 Represents Representation 2Represents 18
  19. 19. Resources 19
  20. 20. Resources have Representations 20
  21. 21. Resources have Representations that Change over Time 21
  22. 22. Only the Current Representation is Available from a Resource 22
  23. 23. Old Representations are Lost Forever 23
  24. 24. Finding Archived Resources Go to http://www.archive.org/ and search http://cnn.com On http://web.archive.org/web/*/http://cnn.com, select desired datetime 24
  25. 25. Archived Resources http://web.archive.org/web/20010911203610/http://www.c nn.com/ archived resource for http://cnn.com http://en.wikipedia.org/w/index.php? title=September_11_attacks&oldid=282333 archived resource for http://en.wikipedia.org/wiki/September_11_attacks Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC 25
  26. 26. Navigating Archived Resources http://en.wikipedia.org/w/index.php? title=September_11_attacks&oldid=282333 archived resource for http://en.wikipedia.org/wiki/September_11_attacks3 Dec 20 2001, 4:51:00 UTC http://en.wikipedia.org/wiki/The_Pentagon current Pentagon 26
  27. 27. Current and Past Web are Not Integrated 27 • Current and Past Web based on same technology. • But, going from Current to Past Web is a matter of (manual) discovery. • Memento wants to make going from Current to Past Web a (HTTP) protocol matter. • Memento wants to integrate Current And Past Web.
  28. 28. One Memento HTTP Navigation 28
  29. 29. Memento HTTP Flow HEAD R, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  30. 30. One Memento HTTP Navigation 30 Scenario • cnn.com includes Link to TimeGate at Internet Archive • URI-R on one server, URI-G & URI-M on another
  31. 31. Memento HTTP Flow HEAD R, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  32. 32. Memento HTTP Flow: URI-R HEAD R, Accept-Datetime HEAD http://cnn.com/ HTTP/1.1 Host: cnn.com Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close 32
  33. 33. Memento HTTP Flow HEAD R, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  34. 34. Memento HTTP Flow: Success – URI-R LinkG HTTP/1.1 200 OK Date: Thu, 21 Jan 2010 00:02:12 GMT Server: Apache Link: <http://web.archive.org/web/timegate/http://cnn.com>; rel="timegate" Content-Length: 255 Connection: close Content-Type: text/html; charset=iso-8859-1 34
  35. 35. Memento HTTP Flow HEAD R, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  36. 36. GET G, Accept-Datetime Memento HTTP Flow: URI-G GET http://web.archive.org/web/timegate/http://cnn.com HTTP/1.1 Host: web.archive.org Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close 36
  37. 37. Memento HTTP Flow HEAD R, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  38. 38. Memento HTTP Flow: Success – URI-G 302M, Vary, LinkR,B,M HTTP/1.1 302 Found Date: Thu, 21 Jan 2010 00:06:50 GMT Server: Apache TCN: choice Vary: negotiate, accept-datetime Location: http://web.archive.org/web/20010911203610/http://www.cnn.com Link: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first- memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT” Content-Length: 0 Connection: close Content-Type: text/plain; charset=UTF-8 38
  39. 39. Memento HTTP Flow HEAD R, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  40. 40. GET M, Accept-Datetime Memento HTTP Flow: URI-M GET http://web.archive.org/web/20010911203610/http://www.cnn.com HTTP/1.1 Host: web.archive.org Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close 40
  41. 41. Memento HTTP Flow HEAD R, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  42. 42. Memento HTTP Flow: Success – URI-M 200, Content-Datetime, LinkR,B,M HTTP/1.1 200 OK Server: Apache-Coyote/1.1 X-Archive-Orig-Accept-Ranges: bytes … Content-Type: text/html;charset=utf-8 Content-Length: 23364 Date: Thu, 21 Jan 2010 00:09:40 GMT Content-Datetime: Tue, 11 Sep 2001 20:36:10 GMT Link: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first-memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT” Connection: close
  43. 43. What does it all mean? • Cutting edge technology • Existing Infrastructure • Redefining Web surfing • MAJOR “real world” implications
  44. 44. Closing Thoughts Preservation not for privileged priesthood http://doi.acm.org/10.1145/1592761.1592794 http://booktwo.org/notebook/wikipedia-historiography/ no more hoary stories about format obsolescence: http://blog.dshr.org/2010/09/reinforcing-my-point.html Don't dessicate resources; leave them on the web Endless metadata is not preservation… archiving as branded service, not infrastructure http://blog.dshr.org/2010/06/jcdl-2010-keynote.html
  45. 45. Acknowledgements • Slides borrowed from: • Dr. Michael L. Nelson: – http://www.slideshare.net/phonedude/my-point-of-view- michael-l-nelson-web-archiving-cooperative – http://www.slideshare.net/phonedude/review-of-web- archiving – http://www.slideshare.net/phonedude/memento-time- travel-for-the-web • Martin Klein: – http://www.slideshare.net/phonedude/synchronicity- justintime-discovery-of-lost-web-pages
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×