Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Digital Preservation - ODU

on

  • 864 views

This is the slide deck of the presentation given to the RRAC national group meeting on 10-20-2010. It is a summary of the research efforts in Digital Preservation at ODU.

This is the slide deck of the presentation given to the RRAC national group meeting on 10-20-2010. It is a summary of the research efforts in Digital Preservation at ODU.

Statistics

Views

Total Views
864
Views on SlideShare
838
Embed Views
26

Actions

Likes
0
Downloads
4
Comments
0

2 Embeds 26

http://ws-dl.blogspot.com 25
http://ws-dl.blogspot.sg 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Digital Preservation - ODU Digital Preservation - ODU Presentation Transcript

  • Digital Preservation Research at Old Dominion University Justin F. Brunelle The MITRE Corporation Old Dominion University (And hopefully MITRE, soon)
  • Why are we listening?
    • Overview of the problem
    • BRIEF introduction to ODU WSDL group research
    • Memento
    • I’ll be skipping around, so don’t hesitate to interrupt me
  • Digital Preservation
    • Using the past Web
      • Focus of our research
    • Temporal Browsing
      • Sessions in the past
    • Recovering Lost Pages
      • Is it really gone?
    • 404s
      • How to fix broken links?
  • Change on the Web 1 same URI maps to same or very similar content at a later time 2 same URI maps to different content at a later time 3 different URI maps to same or very similar content at the same or at a later time 4 the content can not be found at any URI U1 C1 U1 C1 time A B U1 C2 U1 C1 time A B U2 C1 U1 C1 U1 404 time A B U1 ?? U1 C1 time A B
  • Time to Talk About Saving Everything? Dinner for one or two costs more than 1TB disk Wikis have popularized versioning Cool URIs ( http://www.w3.org/Provider/Style/URI.html ) are widely adopted, e.g.: http://news.yahoo.com/s/ap/20100920/ap_on_el_se/us_alaska_senate http://d.yimg.com/a/p/ap/20100918/capt.67567dbc0a874b689f0b4a5c392f379c-67567dbc0a874b689f0b4a5c392f379c-0.jpg http://d.yimg.com/a/p/afp/20100918/thumb.photo_1284846332993-1-0.jpg Also related projects with cool URI / permalink focus: http://www.citability.org/ http://data.gov/ http://data.gov.uk/
  • Fortress Model
    • Get a lot of money
    • Buy lots of storage
    • Hire lots of people
    • “ Look upon my archive ye Mighty, and despair!”
  • Alternate Methods
    • Lazy Preservation (McCown)
      • “ How much preservation do I get if I do absolutely nothing?”
    • Just-In-Time Preservation (Klein)
      • Wait for it to disappear, then find a “good ‘nuff” version
    • Shared Infrastructure Preservation
      • Push content to sites that might preserve it
        • arXiv.org, IA, WebCite…
    • Server Enhanced Preservation
      • Create archival-ready resources
  • And Soon…
    • Social Preservation
      • Preserving resources using 3 rd party Web Services
      • Repository for OAI-ORE ReMs
      • Social network feel
      • Lazy-esque, server-side reconstruction
  • But I digress…
    • Few years away…
    • Preliminary research
    • And now back to the prior research…
  • Web Infrastructure (McCown, 2007)
  • WayBack Machine http://web.archive.org/web/*/http://www.thecribs.com/ http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://www.thecribs.com/
    • from these we can
    • create time-based:
    • indexes
    • IDF values
    • PageRank
  • Batch Recovery For Sites http://warrick.cs.odu.edu/ Free limo rides for life?!
  • Reconstruction Diagram added 20% identical 50% changed 33% missing 17%
  • Real-Time Recovery for URIs Synchronicity - www.cs.odu.edu/~mklein/
  • Memento wants to make navigating the Web’s Past Easy
      • http://www.mementoweb.org
      • http://groups.google.com/group/memento-dev
  • What are you talking about?
    • Universal Resource Identifier (URI) ~= URL
    • Resource:
      • <HTML>
    • Representation
  • W3C Web Architecture: Resource – URI - Representation Resource Representation Represents URI Identifies dereference
  • W3C Web Architecture: Resource – URI - Representation dereference content negotiation Resource URI Identifies Representation 1 Represents Representation 2 Represents
  • Resources
  • Resources have Representations
  • Resources have Representations that Change over Time
  • Only the Current Representation is Available from a Resource
  • Old Representations are Lost Forever
  • Finding Archived Resources Go to http://www.archive.org/ and search http://cnn.com On http://web.archive.org/web/*/http://cnn.com , select desired datetime
  • Archived Resources http://web.archive.org/web/20010911203610/http://www.cnn.com/ archived resource for http://cnn.com http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived resource for http://en.wikipedia.org/wiki/September_11_attacks Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC
  • Navigating Archived Resources http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived resource for http://en.wikipedia.org/wiki/September_11_attacks3 Dec 20 2001, 4:51:00 UTC http://en.wikipedia.org/wiki/The_Pentagon current Pentagon
  • Current and Past Web are Not Integrated
    • Current and Past Web based on same technology.
    • But, going from Current to Past Web is a matter of (manual) discovery.
    • Memento wants to make going from Current to Past Web a (HTTP) protocol matter.
    • Memento wants to integrate Current And Past Web.
  • One Memento HTTP Navigation
  • Memento HTTP Flow HEAD R, Accept-Datetime Link  G 302  M, Vary, TCN, Link  R,B,M 200, Content-Datetime, Link  R,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • One Memento HTTP Navigation
    • Scenario
    • cnn.com includes Link to TimeGate at Internet Archive
    • URI-R on one server, URI-G & URI-M on another
  • Memento HTTP Flow HEAD R, Accept-Datetime Link  G 302  M, Vary, TCN, Link  R,B,M 200, Content-Datetime, Link  R,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • Memento HTTP Flow: URI-R HEAD R, Accept-Datetime HEAD http://cnn.com/ HTTP/1.1 Host: cnn.com Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close
  • Memento HTTP Flow HEAD R, Accept-Datetime Link  G 302  M, Vary, TCN, Link  R,B,M 200, Content-Datetime, Link  R,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • Memento HTTP Flow: Success – URI-R Link  G HTTP/1.1 200 OK Date: Thu, 21 Jan 2010 00:02:12 GMT Server: Apache Link: <http://web.archive.org/web/timegate/http://cnn.com>; rel=&quot;timegate&quot; Content-Length: 255 Connection: close Content-Type: text/html; charset=iso-8859-1
  • Memento HTTP Flow HEAD R, Accept-Datetime Link  G 302  M, Vary, TCN, Link  R,B,M 200, Content-Datetime, Link  R,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • Memento HTTP Flow: URI-G GET G, Accept-Datetime GET http://web.archive.org/web/timegate/http://cnn.com HTTP/1.1 Host: web.archive.org Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close
  • Memento HTTP Flow HEAD R, Accept-Datetime Link  G 302  M, Vary, TCN, Link  R,B,M 200, Content-Datetime, Link  R,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • Memento HTTP Flow: Success – URI-G 302  M, Vary, Link  R,B,M HTTP/1.1 302 Found Date: Thu, 21 Jan 2010 00:06:50 GMT Server: Apache TCN: choice Vary: negotiate, accept-datetime Location: http://web.archive.org/web/20010911203610/http://www.cnn.com Link: <http://cnn.com/>; rel=&quot;original&quot;, <http://web.archive.org/web/timebundle/http://cnn.com/>; rel=&quot;timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first- memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime=&quot;Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime=&quot;Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime=&quot;Tue, 11 Sep 2001 20:47:33 GMT” Content-Length: 0 Connection: close Content-Type: text/plain; charset=UTF-8
  • Memento HTTP Flow HEAD R, Accept-Datetime Link  G 302  M, Vary, TCN, Link  R,B,M 200, Content-Datetime, Link  R,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • Memento HTTP Flow: URI-M GET M, Accept-Datetime GET http://web.archive.org/web/20010911203610/http://www.cnn.com HTTP/1.1 Host: web.archive.org Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT Connection: close
  • Memento HTTP Flow HEAD R, Accept-Datetime Link  G 302  M, Vary, TCN, Link  R,B,M 200, Content-Datetime, Link  R,B,M GET G, Accept-Datetime GET M, Accept-Datetime
  • Memento HTTP Flow: Success – URI-M 200, Content-Datetime, Link  R,B,M HTTP/1.1 200 OK Server: Apache-Coyote/1.1 X-Archive-Orig-Accept-Ranges: bytes … Content-Type: text/html;charset=utf-8 Content-Length: 23364 Date: Thu, 21 Jan 2010 00:09:40 GMT Content-Datetime: Tue, 11 Sep 2001 20:36:10 GMT Link: <http://cnn.com/>; rel=&quot;original&quot;, <http://web.archive.org/web/timebundle/http://cnn.com/>; rel=&quot;timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first-memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime=&quot;Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime=&quot;Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime=&quot;Tue, 11 Sep 2001 20:47:33 GMT” Connection: close
  • What does it all mean?
    • Cutting edge technology
    • Existing Infrastructure
    • Redefining Web surfing
    • MAJOR “real world” implications
  • Closing Thoughts Preservation not for privileged priesthood http://doi.acm.org/10.1145/1592761.1592794 http://booktwo.org/notebook/wikipedia-historiography/ no more hoary stories about format obsolescence: http://blog.dshr.org/2010/09/reinforcing-my-point.html Don't dessicate resources; leave them on the web Endless metadata is not preservation… archiving as branded service, not infrastructure http://blog.dshr.org/2010/06/jcdl-2010-keynote.html
  • Acknowledgements
    • Slides borrowed from:
    • Dr. Michael L. Nelson:
      • http://www.slideshare.net/phonedude/my-point-of-view-michael-l-nelson-web-archiving-cooperative
      • http://www.slideshare.net/phonedude/review-of-web-archiving
      • http://www.slideshare.net/phonedude/memento-time-travel-for-the-web
    • Martin Klein:
      • http://www.slideshare.net/phonedude/synchronicity-justintime-discovery-of-lost-web-pages