Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Like this? Share it with your network

Share

Digital Preservation - ODU

on

  • 917 views

This is the slide deck of the presentation given to the RRAC national group meeting on 10-20-2010. It is a summary of the research efforts in Digital Preservation at ODU.

This is the slide deck of the presentation given to the RRAC national group meeting on 10-20-2010. It is a summary of the research efforts in Digital Preservation at ODU.

Statistics

Views

Total Views
917
Views on SlideShare
891
Embed Views
26

Actions

Likes
0
Downloads
4
Comments
0

2 Embeds 26

http://ws-dl.blogspot.com 25
http://ws-dl.blogspot.sg 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Digital Preservation - ODU Presentation Transcript

  • 1. Digital Preservation Research at Old Dominion University Justin F. Brunelle The MITRE Corporation Old Dominion University (And hopefully MITRE, soon)
  • 2. Why are we listening?
    • Overview of the problem
    • BRIEF introduction to ODU WSDL group research
    • Memento
    • I’ll be skipping around, so don’t hesitate to interrupt me
  • 3. Digital Preservation
    • Using the past Web
      • Focus of our research
    • Temporal Browsing
      • Sessions in the past
    • Recovering Lost Pages
      • Is it really gone?
    • 404s
      • How to fix broken links?
  • 4. Change on the Web 1 same URI maps to same or very similar content at a later time 2 same URI maps to different content at a later time 3 different URI maps to same or very similar content at the same or at a later time 4 the content can not be found at any URI U1 C1 U1 C1 time A B U1 C2 U1 C1 time A B U2 C1 U1 C1 U1 404 time A B U1 ?? U1 C1 time A B
  • 5. Time to Talk About Saving Everything? Dinner for one or two costs more than 1TB disk Wikis have popularized versioning Cool URIs ( http://www.w3.org/Provider/Style/URI.html ) are widely adopted, e.g.: http://news.yahoo.com/s/ap/20100920/ap_on_el_se/us_alaska_senate http://d.yimg.com/a/p/ap/20100918/capt.67567dbc0a874b689f0b4a5c392f379c-67567dbc0a874b689f0b4a5c392f379c-0.jpg http://d.yimg.com/a/p/afp/20100918/thumb.photo_1284846332993-1-0.jpg Also related projects with cool URI / permalink focus: http://www.citability.org/ http://data.gov/ http://data.gov.uk/
  • 6. Fortress Model
    • Get a lot of money
    • Buy lots of storage
    • Hire lots of people
    • “ Look upon my archive ye Mighty, and despair!”
  • 7. Alternate Methods
    • Lazy Preservation (McCown)
      • “ How much preservation do I get if I do absolutely nothing?”
    • Just-In-Time Preservation (Klein)
      • Wait for it to disappear, then find a “good ‘nuff” version
    • Shared Infrastructure Preservation
      • Push content to sites that might preserve it
        • arXiv.org, IA, WebCite…
    • Server Enhanced Preservation
      • Create archival-ready resources
  • 8. And Soon…
    • Social Preservation
      • Preserving resources using 3 rd party Web Services
      • Repository for OAI-ORE ReMs
      • Social network feel
      • Lazy-esque, server-side reconstruction
  • 9. But I digress…
    • Few years away…
    • Preliminary research
    • And now back to the prior research…
  • 10. Web Infrastructure (McCown, 2007)
  • 11. WayBack Machine http://web.archive.org/web/*/http://www.thecribs.com/ http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://www.thecribs.com/
    • from these we can
    • create time-based:
    • indexes
    • IDF values
    • PageRank
  • 12. Batch Recovery For Sites http://warrick.cs.odu.edu/ Free limo rides for life?!
  • 13. Reconstruction Diagram added 20% identical 50% changed 33% missing 17%
  • 14. Real-Time Recovery for URIs Synchronicity - www.cs.odu.edu/~mklein/
  • 15. Memento wants to make navigating the Web’s Past Easy
      • http://www.mementoweb.org
      • http://groups.google.com/group/memento-dev
  • 16. What are you talking about?
    • Universal Resource Identifier (URI) ~= URL
    • Resource:
      • <HTML>
    • Representation
  • 17. W3C Web Architecture: Resource – URI - Representation Resource Representation Represents URI Identifies dereference
  • 18. W3C Web Architecture: Resource – URI - Representation dereference content negotiation Resource URI Identifies Representation 1 Represents Representation 2 Represents
  • 19. Resources
  • 20. Resources have Representations
  • 21. Resources have Representations that Change over Time
  • 22. Only the Current Representation is Available from a Resource
  • 23. Old Representations are Lost Forever
  • 24. Finding Archived Resources Go to http://www.archive.org/ and search http://cnn.com On http://web.archive.org/web/*/http://cnn.com , select desired datetime
  • 25. Archived Resources http://web.archive.org/web/20010911203610/http://www.cnn.com/ archived resource for http://cnn.com http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived resource for http://en.wikipedia.org/wiki/September_11_attacks Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC
  • 26. Navigating Archived Resources http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived resource for http://en.wikipedia.org/wiki/September_11_attacks3 Dec 20 2001, 4:51:00 UTC http://en.wikipedia.org/wiki/The_Pentagon current Pentagon
  • 27. Current and Past Web are Not Integrated
    • Current and Past Web based on same technology.
    • But, going from Current to Past Web is a matter of (manual) discovery.
    • Memento wants to make going from Current to Past Web a (HTTP) protocol matter.
    • Memento wants to integrate Current And Past Web.
  • 28. One Memento HTTP Navigation
  • 29. Memento HTTP Flow HEAD R, Accept-Datetime Link  G 302  M, Vary, TCN, Link  R,B,M 200, Content-Datetime, Link  R,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 30. One Memento HTTP Navigation
    • Scenario
    • cnn.com includes Link to TimeGate at Internet Archive
    • URI-R on one server, URI-G & URI-M on another
  • 31. Memento HTTP Flow HEAD R, Accept-Datetime Link  G 302  M, Vary, TCN, Link  R,B,M 200, Content-Datetime, Link  R,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 32. Memento HTTP Flow: URI-R HEAD R, Accept-Datetime HEAD http://cnn.com/ HTTP/1.1 Host: cnn.com Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close
  • 33. Memento HTTP Flow HEAD R, Accept-Datetime Link  G 302  M, Vary, TCN, Link  R,B,M 200, Content-Datetime, Link  R,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 34. Memento HTTP Flow: Success – URI-R Link  G HTTP/1.1 200 OK Date: Thu, 21 Jan 2010 00:02:12 GMT Server: Apache Link: <http://web.archive.org/web/timegate/http://cnn.com>; rel=&quot;timegate&quot; Content-Length: 255 Connection: close Content-Type: text/html; charset=iso-8859-1
  • 35. Memento HTTP Flow HEAD R, Accept-Datetime Link  G 302  M, Vary, TCN, Link  R,B,M 200, Content-Datetime, Link  R,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 36. Memento HTTP Flow: URI-G GET G, Accept-Datetime GET http://web.archive.org/web/timegate/http://cnn.com HTTP/1.1 Host: web.archive.org Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close
  • 37. Memento HTTP Flow HEAD R, Accept-Datetime Link  G 302  M, Vary, TCN, Link  R,B,M 200, Content-Datetime, Link  R,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 38. Memento HTTP Flow: Success – URI-G 302  M, Vary, Link  R,B,M HTTP/1.1 302 Found Date: Thu, 21 Jan 2010 00:06:50 GMT Server: Apache TCN: choice Vary: negotiate, accept-datetime Location: http://web.archive.org/web/20010911203610/http://www.cnn.com Link: <http://cnn.com/>; rel=&quot;original&quot;, <http://web.archive.org/web/timebundle/http://cnn.com/>; rel=&quot;timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first- memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime=&quot;Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime=&quot;Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime=&quot;Tue, 11 Sep 2001 20:47:33 GMT” Content-Length: 0 Connection: close Content-Type: text/plain; charset=UTF-8
  • 39. Memento HTTP Flow HEAD R, Accept-Datetime Link  G 302  M, Vary, TCN, Link  R,B,M 200, Content-Datetime, Link  R,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 40. Memento HTTP Flow: URI-M GET M, Accept-Datetime GET http://web.archive.org/web/20010911203610/http://www.cnn.com HTTP/1.1 Host: web.archive.org Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close
  • 41. Memento HTTP Flow HEAD R, Accept-Datetime Link  G 302  M, Vary, TCN, Link  R,B,M 200, Content-Datetime, Link  R,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • 42. Memento HTTP Flow: Success – URI-M 200, Content-Datetime, Link  R,B,M HTTP/1.1 200 OK Server: Apache-Coyote/1.1 X-Archive-Orig-Accept-Ranges: bytes … Content-Type: text/html;charset=utf-8 Content-Length: 23364 Date: Thu, 21 Jan 2010 00:09:40 GMT Content-Datetime: Tue, 11 Sep 2001 20:36:10 GMT Link: <http://cnn.com/>; rel=&quot;original&quot;, <http://web.archive.org/web/timebundle/http://cnn.com/>; rel=&quot;timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first-memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime=&quot;Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime=&quot;Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime=&quot;Tue, 11 Sep 2001 20:47:33 GMT” Connection: close
  • 43. What does it all mean?
    • Cutting edge technology
    • Existing Infrastructure
    • Redefining Web surfing
    • MAJOR “real world” implications
  • 44. Closing Thoughts Preservation not for privileged priesthood http://doi.acm.org/10.1145/1592761.1592794 http://booktwo.org/notebook/wikipedia-historiography/ no more hoary stories about format obsolescence: http://blog.dshr.org/2010/09/reinforcing-my-point.html Don't dessicate resources; leave them on the web Endless metadata is not preservation… archiving as branded service, not infrastructure http://blog.dshr.org/2010/06/jcdl-2010-keynote.html
  • 45. Acknowledgements
    • Slides borrowed from:
    • Dr. Michael L. Nelson:
      • http://www.slideshare.net/phonedude/my-point-of-view-michael-l-nelson-web-archiving-cooperative
      • http://www.slideshare.net/phonedude/review-of-web-archiving
      • http://www.slideshare.net/phonedude/memento-time-travel-for-the-web
    • Martin Klein:
      • http://www.slideshare.net/phonedude/synchronicity-justintime-discovery-of-lost-web-pages