Your SlideShare is downloading. ×
0
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Digital Preservation - ODU
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Digital Preservation - ODU

804

Published on

This is the slide deck of the presentation given to the RRAC national group meeting on 10-20-2010. It is a summary of the research efforts in Digital Preservation at ODU.

This is the slide deck of the presentation given to the RRAC national group meeting on 10-20-2010. It is a summary of the research efforts in Digital Preservation at ODU.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
804
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Digital Preservation Research at Old Dominion University Justin F. Brunelle The MITRE Corporation Old Dominion University (And hopefully MITRE, soon)
  • 2. Why are we listening? • Overview of the problem • BRIEF introduction to ODU WSDL group research • Memento • I’ll be skipping around, so don’t hesitate to interrupt me
  • 3. Digital Preservation • Using the past Web – Focus of our research • Temporal Browsing – Sessions in the past • Recovering Lost Pages – Is it really gone? • 404s – How to fix broken links?
  • 4. 1 same URI maps to same or very similar content at a later time 2 same URI maps to different content at a later time 3 different URI maps to same or very similar content at the same or at a later time 4 the content can not be found at any URI U1 C1 U1 C1 timeA B U1 C2 U1 C1 timeA B U2 C1 U1 C1 U1 404 timeA B U1 ?? U1 C1 timeA B Change on the Web
  • 5. Time to Talk About Saving Everything? Dinner for one or two costs more than 1TB disk Wikis have popularized versioning Cool URIs (http://www.w3.org/Provider/Style/URI.html) are widely adopted, e.g.: http://news.yahoo.com/s/ap/20100920/ap_on_el_se/us_alaska_senate http://d.yimg.com/a/p/ap/20100918/capt.67567dbc0a874b689f0b4a5c392f379c-67567dbc0a874b689f0b4a5c392f379c-0.jpg http://d.yimg.com/a/p/afp/20100918/thumb.photo_1284846332993-1-0.jpg Also related projects with cool URI / permalink focus: http://www.citability.org/ http://data.gov/ http://data.gov.uk/
  • 6. Fortress Model • Get a lot of money • Buy lots of storage • Hire lots of people • “Look upon my archive ye Mighty, and despair!”
  • 7. Alternate Methods • Lazy Preservation (McCown) – “How much preservation do I get if I do absolutely nothing?” • Just-In-Time Preservation (Klein) – Wait for it to disappear, then find a “good ‘nuff” version • Shared Infrastructure Preservation – Push content to sites that might preserve it • arXiv.org, IA, WebCite… • Server Enhanced Preservation – Create archival-ready resources
  • 8. And Soon… • Social Preservation – Preserving resources using 3rd party Web Services – Repository for OAI-ORE ReMs – Social network feel – Lazy-esque, server-side reconstruction
  • 9. But I digress… • Few years away… • Preliminary research • And now back to the prior research…
  • 10. Web Infrastructure (McCown, 2007)
  • 11. WayBack Machine http://web.archive.org/web/*/http://www.thecribs.com/ http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://www.thecribs.com/ from these we can create time-based: • indexes • IDF values • PageRank
  • 12. Batch Recovery For Sites http://warrick.cs.odu.edu/ Free limo rides for life?!
  • 13. 13 Reconstruction Diagram added 20% identical 50% changed 33% missing 17%
  • 14. Real-Time Recovery for URIs Synchronicity - www.cs.odu.edu/~mklein/
  • 15. Memento wants to make navigating the Web’s Past Easy 15 http://www.mementoweb.org http://groups.google.com/group/memento-dev
  • 16. What are you talking about? • Universal Resource Identifier (URI) ~= URL • Resource: – <HTML> • Representation
  • 17. W3C Web Architecture: Resource – URI - Representation Resource Representation Represents URI Identifies dereference 17
  • 18. dereference content negotiation W3C Web Architecture: Resource – URI - Representation Resource URI Identifies Representation 1 Represents Representation 2Represents 18
  • 19. Resources 19
  • 20. Resources have Representations 20
  • 21. Resources have Representations that Change over Time 21
  • 22. Only the Current Representation is Available from a Resource 22
  • 23. Old Representations are Lost Forever 23
  • 24. Finding Archived Resources Go to http://www.archive.org/ and search http://cnn.com On http://web.archive.org/web/*/http://cnn.com, select desired datetime 24
  • 25. Archived Resources http://web.archive.org/web/20010911203610/http://www.c nn.com/ archived resource for http://cnn.com http://en.wikipedia.org/w/index.php? title=September_11_attacks&oldid=282333 archived resource for http://en.wikipedia.org/wiki/September_11_attacks Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC 25
  • 26. Navigating Archived Resources http://en.wikipedia.org/w/index.php? title=September_11_attacks&oldid=282333 archived resource for http://en.wikipedia.org/wiki/September_11_attacks3 Dec 20 2001, 4:51:00 UTC http://en.wikipedia.org/wiki/The_Pentagon current Pentagon 26
  • 27. Current and Past Web are Not Integrated 27 • Current and Past Web based on same technology. • But, going from Current to Past Web is a matter of (manual) discovery. • Memento wants to make going from Current to Past Web a (HTTP) protocol matter. • Memento wants to integrate Current And Past Web.
  • 28. One Memento HTTP Navigation 28
  • 29. Memento HTTP Flow HEAD R, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 30. One Memento HTTP Navigation 30 Scenario • cnn.com includes Link to TimeGate at Internet Archive • URI-R on one server, URI-G & URI-M on another
  • 31. Memento HTTP Flow HEAD R, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 32. Memento HTTP Flow: URI-R HEAD R, Accept-Datetime HEAD http://cnn.com/ HTTP/1.1 Host: cnn.com Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close 32
  • 33. Memento HTTP Flow HEAD R, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 34. Memento HTTP Flow: Success – URI-R LinkG HTTP/1.1 200 OK Date: Thu, 21 Jan 2010 00:02:12 GMT Server: Apache Link: <http://web.archive.org/web/timegate/http://cnn.com>; rel="timegate" Content-Length: 255 Connection: close Content-Type: text/html; charset=iso-8859-1 34
  • 35. Memento HTTP Flow HEAD R, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 36. GET G, Accept-Datetime Memento HTTP Flow: URI-G GET http://web.archive.org/web/timegate/http://cnn.com HTTP/1.1 Host: web.archive.org Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close 36
  • 37. Memento HTTP Flow HEAD R, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 38. Memento HTTP Flow: Success – URI-G 302M, Vary, LinkR,B,M HTTP/1.1 302 Found Date: Thu, 21 Jan 2010 00:06:50 GMT Server: Apache TCN: choice Vary: negotiate, accept-datetime Location: http://web.archive.org/web/20010911203610/http://www.cnn.com Link: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first- memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT” Content-Length: 0 Connection: close Content-Type: text/plain; charset=UTF-8 38
  • 39. Memento HTTP Flow HEAD R, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 40. GET M, Accept-Datetime Memento HTTP Flow: URI-M GET http://web.archive.org/web/20010911203610/http://www.cnn.com HTTP/1.1 Host: web.archive.org Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close 40
  • 41. Memento HTTP Flow HEAD R, Accept-Datetime LinkG 302M, Vary, TCN, LinkR,B,M 200, Content-Datetime, LinkR,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 42. Memento HTTP Flow: Success – URI-M 200, Content-Datetime, LinkR,B,M HTTP/1.1 200 OK Server: Apache-Coyote/1.1 X-Archive-Orig-Accept-Ranges: bytes … Content-Type: text/html;charset=utf-8 Content-Length: 23364 Date: Thu, 21 Jan 2010 00:09:40 GMT Content-Datetime: Tue, 11 Sep 2001 20:36:10 GMT Link: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first-memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT” Connection: close
  • 43. What does it all mean? • Cutting edge technology • Existing Infrastructure • Redefining Web surfing • MAJOR “real world” implications
  • 44. Closing Thoughts Preservation not for privileged priesthood http://doi.acm.org/10.1145/1592761.1592794 http://booktwo.org/notebook/wikipedia-historiography/ no more hoary stories about format obsolescence: http://blog.dshr.org/2010/09/reinforcing-my-point.html Don't dessicate resources; leave them on the web Endless metadata is not preservation… archiving as branded service, not infrastructure http://blog.dshr.org/2010/06/jcdl-2010-keynote.html
  • 45. Acknowledgements • Slides borrowed from: • Dr. Michael L. Nelson: – http://www.slideshare.net/phonedude/my-point-of-view- michael-l-nelson-web-archiving-cooperative – http://www.slideshare.net/phonedude/review-of-web- archiving – http://www.slideshare.net/phonedude/memento-time- travel-for-the-web • Martin Klein: – http://www.slideshare.net/phonedude/synchronicity- justintime-discovery-of-lost-web-pages

×