Analyzing the Persistence of Referenced    Web Resources with Memento                                         Robert Sande...
Overview•  Motivating Horror Story•  Memento•  Experiment•  Results•  Conclusions/Future Work                 Persistence ...
A Motivating Academic Horror Story    Persistence of Referenced Web Resources                                             ...
A Motivating Academic Horror Story    Persistence of Referenced Web Resources                                             ...
A Motivating Academic Horror Story    Persistence of Referenced Web Resources                                             ...
A Motivating Academic Horror Story    Persistence of Referenced Web Resources                                             ...
Another Motivating Academic Horror Story        Persistence of Referenced Web Resources                                   ...
Another Motivating Academic Horror Story        Persistence of Referenced Web Resources                                   ...
Question 1To what extent are web resources that are referenced from works in repositories still available at their origina...
Our Hero Enters the Scene! Persistence of Referenced Web Resources                                             10Open Repo...
Question 1(redux)To what extent are web resources that are referenced from works in repositories still available at their ...
Memento FrameworkMemento wants to make it easy to navigate the web of the past                             •  Global versi...
Original Resources and Mementos    Persistence of Referenced Web Resources                                                ...
Memento: Bridge from Present to Past     Persistence of Referenced Web Resources                                          ...
Memento: Bridge from Present to Past     Persistence of Referenced Web Resources                                          ...
Multiple Archives Persistence of Referenced Web Resources                                             16Open Repositories ...
Original Resource’s Server Gone   Persistence of Referenced Web Resources                                               17...
Question 2How long is the period between the publication of a paper  and the archiving of a resource cited by that paper? ...
ExperimentUsing Memento, check all of the links extracted from papers inrepositories to discover:    •  Are they still res...
Experimental Process Extract       Extract  Links       Metadata  Filter *                                             * W...
Results: Archiving Extent per Repository            UNT             •  72% in archives and/or still exist                 ...
Results: Days between Publication and Archive    Typical long tail, but inexplicably similar curves at    different scales...
Results: Archiving Extent Per Discipline                              UNT          •  Most disciplines exhibit            ...
ConclusionsBiggest Issues:   •  Need access to the URIs extracted from repository resources   •  Need a web archive of sch...
Future Work•  Repeat with much larger dataset     •  JSTOR     •  CiteSeer     •  Astrophysics Data System     •  RePeC   ...
Thank You!                •  Rob Sanderson                     •  Twitter: @azaroth42                     •  Email: azarot...
Upcoming SlideShare
Loading in …5
×

Analyzing the Persistence of Referenced Web Resources with Memento

1,156 views

Published on

Research into the extent of web resources referenced from anywhere in the text of publications held in scholarly repositories.

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,156
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Analyzing the Persistence of Referenced Web Resources with Memento

  1. 1. Analyzing the Persistence of Referenced Web Resources with Memento Robert Sanderson Mark Phillips Herbert Van de Sompel http://mementoweb.org/ Memento is funded by The Library of Congress Persistence of Referenced Web Resources Open Repositories 2011, Austin TX, June 6-11
  2. 2. Overview•  Motivating Horror Story•  Memento•  Experiment•  Results•  Conclusions/Future Work Persistence of Referenced Web Resources 2 Open Repositories 2011, Austin TX, June 6-11
  3. 3. A Motivating Academic Horror Story Persistence of Referenced Web Resources 3 Open Repositories 2011, Austin TX, June 6-11
  4. 4. A Motivating Academic Horror Story Persistence of Referenced Web Resources 4 Open Repositories 2011, Austin TX, June 6-11
  5. 5. A Motivating Academic Horror Story Persistence of Referenced Web Resources 5 Open Repositories 2011, Austin TX, June 6-11
  6. 6. A Motivating Academic Horror Story Persistence of Referenced Web Resources 6 Open Repositories 2011, Austin TX, June 6-11
  7. 7. Another Motivating Academic Horror Story Persistence of Referenced Web Resources 7 Open Repositories 2011, Austin TX, June 6-11
  8. 8. Another Motivating Academic Horror Story Persistence of Referenced Web Resources 8 Open Repositories 2011, Austin TX, June 6-11
  9. 9. Question 1To what extent are web resources that are referenced from works in repositories still available at their original URL? Significant prior art! But very small scale other than Lawrences early work on Citeseer (See paper for references) Persistence of Referenced Web Resources 9 Open Repositories 2011, Austin TX, June 6-11
  10. 10. Our Hero Enters the Scene! Persistence of Referenced Web Resources 10Open Repositories 2011, Austin TX, June 6-11
  11. 11. Question 1(redux)To what extent are web resources that are referenced from works in repositories still available at their original URL … or from archives of web resources? Prior art sketchy at best, as lacks automated method to enable discovery of archived web resources. Persistence of Referenced Web Resources 11 Open Repositories 2011, Austin TX, June 6-11
  12. 12. Memento FrameworkMemento wants to make it easy to navigate the web of the past •  Global version indicator: Time •  Based on the primitives of the Web: resource, representation, content negotiation, link •  Functionality: Given a URI and a Datetime, resolve the closest archived copy Persistence of Referenced Web Resources 12 Open Repositories 2011, Austin TX, June 6-11
  13. 13. Original Resources and Mementos Persistence of Referenced Web Resources 13 Open Repositories 2011, Austin TX, June 6-11
  14. 14. Memento: Bridge from Present to Past Persistence of Referenced Web Resources 14 Open Repositories 2011, Austin TX, June 6-11
  15. 15. Memento: Bridge from Present to Past Persistence of Referenced Web Resources 15 Open Repositories 2011, Austin TX, June 6-11
  16. 16. Multiple Archives Persistence of Referenced Web Resources 16Open Repositories 2011, Austin TX, June 6-11
  17. 17. Original Resource’s Server Gone Persistence of Referenced Web Resources 17 Open Repositories 2011, Austin TX, June 6-11
  18. 18. Question 2How long is the period between the publication of a paper and the archiving of a resource cited by that paper? Memento allows us to answer this question. Persistence of Referenced Web Resources 18 Open Repositories 2011, Austin TX, June 6-11
  19. 19. ExperimentUsing Memento, check all of the links extracted from papers inrepositories to discover: •  Are they still resolvable at their Original URI? •  Are Mementos available in archives? •  What is the Memento-Datetime of the closest copy?Data Set: •  University of North Texas Institutional Repository •  3595 works, 17965 unique URLs •  May 1999 to August 2010 •  arXiv •  400144 works, 144087 unique URLs •  December 1993 to December 2009 •  Total: •  162052 URLs, generating 306452 (URL, Paper) tuples Persistence of Referenced Web Resources 19 Open Repositories 2011, Austin TX, June 6-11
  20. 20. Experimental Process Extract Extract Links Metadata Filter * * We filter broken and Links intra/inter-repository links.Normalize Normalize Links Metadata Results: (URL,Time, (URL, Paper, Time, Subject) Memento- Time, Paper, Subject) Persistence of Referenced Web Resources 20 Open Repositories 2011, Austin TX, June 6-11
  21. 21. Results: Archiving Extent per Repository UNT •  72% in archives and/or still exist •  High proportion of archived URLs, possibly due to academic level and general disciplines arXiv •  78% in archives and/or still exist •  45% still exist, but not archived! Possibly due to high value, but very discipline specific references Persistence of Referenced Web Resources 21 Open Repositories 2011, Austin TX, June 6-11
  22. 22. Results: Days between Publication and Archive Typical long tail, but inexplicably similar curves at different scales for repositories. arXiv: 45% within a month, 80% within a year UNT: 48% within a month, 80% within a year Persistence of Referenced Web Resources 22 Open Repositories 2011, Austin TX, June 6-11
  23. 23. Results: Archiving Extent Per Discipline UNT •  Most disciplines exhibit similar behavior, except History, Journalism and English with lower percentage archived arXiv •  Most disciplines exhibit similar behavior with very low percentage archived within one month, and very high percentage still dereferencable Persistence of Referenced Web Resources 23 Open Repositories 2011, Austin TX, June 6-11
  24. 24. ConclusionsBiggest Issues: •  Need access to the URIs extracted from repository resources •  Need a web archive of scholarly communications context •  WebCite is good, but requires proactive archiving requestProposal: •  Repositories should expose the links extracted from the full text of their resources •  In metadata for the resource •  In an Atom feed … •  To act as seed URL list for a (Memento compliant) web archive Persistence of Referenced Web Resources 24 Open Repositories 2011, Austin TX, June 6-11
  25. 25. Future Work•  Repeat with much larger dataset •  JSTOR •  CiteSeer •  Astrophysics Data System •  RePeC •  PubMed •  arXiv •  10+ ETD Repositories •  SSRN (discussion ongoing) •  Your repository?•  Investigate 45/80 similarity•  Community support for automated scholarly web archive project Persistence of Referenced Web Resources 25 Open Repositories 2011, Austin TX, June 6-11
  26. 26. Thank You! •  Rob Sanderson •  Twitter: @azaroth42 •  Email: azaroth42@gmail.com or rsanderson@lanl.gov •  Paper: http://arxiv.org/abs/1105.3459 •  Slides: http://slidesha.re/ •  Memento: •  http://www.mementoweb.org/ •  http://groups.google.com/group /memento-dev Persistence of Referenced Web Resources 26Open Repositories 2011, Austin TX, June 6-11

×