Link Persistence, Website Persistence

525 views
444 views

Published on

Presentation on the discrepancy between measurements of link persistence and website persistence and why it matters.

Published in: Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
525
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • We’ve been losing the web for as long as it’s existed; the first webpage, created by Tim Berners-Lee, exists as only a copy recreated a year after the original.http://www.w3.org/History/19921103-hypertext/hypertext/WWW/TheProject.html
  • Mainstream recognition of the once-esoteric “page not found” http response code reflects the popular perception of the ephemerality of the web
  • I started looking into the literature on link persistence in preparation for writing a blog post for the Library of Congress’ digital preservation blog, the Signal. Brewster Kahle, founder of the Internet Archive, has offered various numbers for the average lifespan of a webpage over the years. As someone trying to archive the entire public web, he seemed like someone who would know.
  • A meta-study of 17 other studies of link persistence suggested that links decay…but at all sorts of different frequencies.
  • The ambiguity about the ephemerality of web content is reminiscent of Rothenberg’s famous quote about the persistence of digital documents in general.
  • Now let’s take a look at the simplest automated approach to checking the persistence of links.
  • When the client’s browser requests the resource at a particular URL, the web server first sends an http response code, indicating the disposition of the resource at the requested URL. These are some common response codes.
  • An automated link checker, also known as a “spider” or “robot”, works by requesting a series of links and recording the response codes.https://secure.flickr.com/photos/chidorian/3461667159/
  • Response codes are limited, however; they can tell us about the disposition of content at the specified URL, but they can’t tell us what the content at the specified URL is.
  • Considering a link and a corresponding website over time, there are a number of possible scenarios when we go back to check on the persistence of both.
  • The most straight-forward case is where the link, the website, and their correspondence all persist.
  • Sometimes, however, the link still works, but it points to a different website.
  • That website may still exist at another URL.
  • Alternatively, maybe the link doesn’t work.
  • But the website that the link previously corresponded to may still exist at another URL.
  • Lastly, sometimes both the link doesn’t work and the website doesn’t exist.
  • These examples illustrate that link persistence and website persistence are two different things and that using the former as a proxy for the latter misses some of the possible scenarios.
  • Considering those scenarios, conflating link persistence with website persistence will result in systematic mis-measurements of website persistence. How significant are these mis-measurements?
  • Measuring website persistence requires knowing about the state of websites in the past, a perfect use case for web archives. I decided to do a study based on the web archives I was most familiar with, those of the Library of Congress.http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html
  • The U.S. Election 2002 Web Archive is one of their earliest web archive collections. The Library of Congress has archived U.S. national election websites every two years since 2000.
  • There were many more links in the collection than were utilized in this study. Links corresponding to electoral candidate websites were excluded given that they were universally short-lived and would skew the results.
  • The study consisted of two stages. First, we ran Heritrix against the prepared list of links and logged http response codes and redirects.
  • In the second stage, we manually visited each link and noted whether it was the same website as we had previously archived. If it was a different website or if the link didn’t work, we attempted to locate the new location of the previously-archived website using a search engine.
  • The link checker found that 91% of the links ultimately returned a “200” response code. The remaining 9% ultimately returned either “4xx” or “5xx” series response codes.
  • Bringing in the data on whether the working links still corresponded to the same websites, the percentage of working links that still correspond to the same site drops to 83%. Now, 8% of all the links are working links pointing to different sites.
  • Diving in on the non-working links, roughly 77% of the previously-archived websites still exist, even though the previously-archived links no longer point to them.
  • In aggregate, the percentage of websites that still exist after 10 years is 3% higher (94%) than link checking would’ve suggested (91%).This isn’t at all to say that web archiving isn’t important – if I included the candidate websites, the pie chart would suddenly show that less than half of the websites still existed. Also, for example, the White House website has existed for these last ten years, but specific content on the website has invariably disappeared.
  • The results suggest that we may be marginally overestimating website persistence by conflating working links with website persistence but greatly underestimating website persistence by conflating non-working links with websites that have disappeared.
  • The key caveat for these results is that I excluded from the study over 1,000 URLs in the web archive collection that all would have likely been both non-working links and websites that no longer existed. The remaining set of URLs represented those about which I more reasonably supposed there was a more typical probability that they would either persist or disappear.
  • We’re able to effectively perform link checking with current technologies. Can we come up with a better approach to checking the persistence of websites? Better understanding website persistence would facilitate better capacity planning (e.g., by reducing storage requirements for near-duplicate resources), inform capture frequency scheduling, and increase confidence that captured links corresponded to desired websites.
  • A website checker would need to be able to check links, too, but that functionality is already covered. What are the prospects for tools that could check link and website correspondence and check whether a website still exists?
  • In theory, these two latter tasks aren’t that difficult; it’s just that they need to be automated in order to be scalable.
  • Let’s look first at possible tools for checking link and website correspondence.
  • Heritrix already has the ability to compare the checksums of a resource at a particular URL over successive visits. This allows for an “absolute” assessment of sameness.
  • However, even the smallest change is enough to produce a checksum mis-match. We need a tool that can assess the magnitude or importance of the difference between successive versions, not just the fact of a difference.
  • The Vi-DIFF algorithm evaluates both the structure of a webpage and its segmented visual appearance to assess the magnitude of change. As a follow-on to a link checker, the algorithm could be calibrated to indicate whether it was the same site as previously visited or an entirely new one.
  • Now let’s look first at possible tools for checking website persistence, irrespective of link persistence.
  • The lexical signature is a set of keywords that are sufficiently descriptive and unique to be used in a search engine to dereference the page.
  • If the URL no longer works but exists in an archive, the lexical signature can be derived from the archived page and used to locate the new URL.
  • If the URL itself isn’t archived, the lexical signature can be derived from backlinks.
  • These tools exist but are not yet in wide use in the web archiving community. Wider utilization of these tools would allow us to better assess website persistence and the discrepancy with link persistence.
  • Link Persistence, Website Persistence

    1. 1. Link Persistence,Website PersistenceNicholas Taylor@nullhandleMay 28, 2013 ―Forward‖ by Flickr user Hitchster under CC BY 2.0
    2. 2. why preserve the web?
    3. 3. broken links―404‖ by Flickr user adactio under CC BY 2.0
    4. 4. 44 days (Kahle, 1997)
    5. 5. 75 days (Kahle, 2001)
    6. 6. 100 days (Kahle, 2003)
    7. 7. variable (Sanderson, Phillips, andVan de Sompel, 2011)• literature review of 17 studies• research focused on scholarly citations• decay rates of 39-82%• over periods of 1-13 years
    8. 8. ―Digital documents last forever—or five years, whichever comes first.‖(Jeff Rothenberg, 1997)―Out of books sprout... plants‖ by DeviantArt user quinn.anya under CC BY-SA 2.0
    9. 9. LINK CHECKINGThe Art and Science of―http Blue Background‖ by DeviantArt user SoulArt2012 under CC BY-NC-ND 3.0
    10. 10. http response codes• 404: ―Not Found‖• 200: ―OK‖• 301: ―Moved Permanently‖• 500: ―Internal Server Error‖
    11. 11. automated link checker―La Machine @ Yokohama‖ by Flickr user chidorian under CC BY-SA 2.0
    12. 12. what link checking tells us―200 ok‖ by Flickr user reidab under CC BY-NC-SA 2.0
    13. 13. possible scenarios• link works; same website• link works; different website– website may or may not still exist• link doesn’t work; website still exists• link doesn’t work; website no longer exists
    14. 14. link works; same websitehttp://www.fair.org/ (2002) http://www.fair.org/ (2013)
    15. 15. link works; different website…http://www.fb.com/ (2002) http://www.fb.com/ (2013)
    16. 16. …but website still existshttp://www.fb.org/ (2013)
    17. 17. link doesn’t work…http://www.state.mo.us/ (2002) http://www.state.mo.us/ (2013)
    18. 18. …but website still existshttp://www.sos.mo.gov/ (2013)
    19. 19. link doesn’t work;website no longer exists
    20. 20. assumptions• link works; same website• link works; different website– website may or may not still exist• link doesn’t work; website still exists• link doesn’t work; website no longerexists
    21. 21. research questions• how much are we overestimating websitepersistence?– some working links point to different websites• how much are we underestimating websitepersistence?– websites may still exist even though linksdon’t work or do work but point to differentwebsites
    22. 22. WEB ARCHIVESA Study Using
    23. 23. Library of CongressU.S. Election 2002 Web Archive
    24. 24. preparing the list of links• exclude links corresponding to electoralcandidate websites• 1,071 links– state government– political parties– advocacy organizations– major newspapers– political blogs
    25. 25. methodologyautomated• run Heritrix against links,ignoring robots.txt• log http response codes• log redirectsmanual• manually check each link• same website behindworking link?• does website still exist?
    26. 26. methodologyautomated• run Heritrix against links,ignoring robots.txt• log http response codes• log redirectsmanual• manually check each link• same website behindworking link?• does website still exist?
    27. 27. working link?91%9%workingnon-working
    28. 28. same website?83%9%8%working link; same sitenon-working linkworking link; different site
    29. 29. non-working link;website still exists?91%7%2%8%workingstill existsdoesnt exist
    30. 30. website still exists?94%6%still existsdoesnt exist
    31. 31. summary of results• how much are we overestimating websitepersistence?– 8% of working links point to different websites• how much are we underestimating websitepersistence?– 82% of websites associated with non-workinglinks still exist– 48% of websites whose links now point todifferent websites still exist
    32. 32. what does it mean?• websites are (muchmore) persistent thanlinks• websites aresurprisingly durable?―Golden Spider Silk‖ by Flickr user amandabhslater under CC BY-SA 2.0
    33. 33. WEBSITE CHECKING?Beyond Link Checking,―Check‖ by Flickr user ex.libris under CC BY-NC-ND 2.0
    34. 34. building a website checker1. check whether link still works2. check whether link still corresponds towebsite3. check whether website still exists
    35. 35. ―Most web archiving problems are problems of scale.‖(Kris Carpenter Negulescu, 2012)―chutes and ladders‖ by Flickr user reallyboring under CC BY-NC-SA 2.0
    36. 36. building a website checker1. check whether link still works2. check whether link still corresponds towebsite3. check whether website still exists
    37. 37. Heritrix compares checksums―Fingerprint‖ by Flickr user CPOA under CC BY-ND 2.0
    38. 38. …but checksums are limited―Hashing Emily‖ by Flickr user wlef70 under CC BY-NC-SA 3.0
    39. 39. visual analysis of page changesPehlivan, Ben-Saad, and Gançarski: ―Vi-DIFF: Understanding Web Pages Changes‖
    40. 40. building a website checker1. check whether link still works2. check whether link still corresponds towebsite3. check whether website still exists
    41. 41. lexical signature of archived pageWare, Klein, and Nelson: ―An Evaluation of Link NeighborhoodLexical Signatures to Rediscover Missing Web Pages‖
    42. 42. find archived pages w/ Memento• http protocolenhancement• enables discovery ofarchived resources indistributed webarchives
    43. 43. lexical signatures of backlink pages
    44. 44. ―The future is already here; it’s just not very evenly distributed.‖(William Gibson, 1999)―Time Travel‖ by Flickr user xcalibr under CC BY-NC-ND 2.0
    45. 45. Nicholas Taylor@nullhandle―Thank You‖ by Flickr user muffintinmom under CC BY 2.0

    ×