Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using the Memento Framework to Assess Content Drift in Scholarly Communication

574 views

Published on

Presentation at IIPC Web Archiving Conference 2017

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Using the Memento Framework to Assess Content Drift in Scholarly Communication

  1. 1. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK Using the Memento Framework to Assess Content Drift in Scholarly Communication Acknowledgements: Shawn Jones, Harihar Shankar (LANL) Richard Tobin, Claire Grover (University of of Edinburgh) Andy Jackson (British Library) Martin Klein @mart1nkle1n Herbert Van de Sompel @hvdsomp Research Library Los Alamos National Laboratory
  2. 2. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 2 Link Rot
  3. 3. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 3
  4. 4. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 4 Content Drift
  5. 5. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 5 http://dl00.org 2000
  6. 6. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 6 http://dl00.org 2004
  7. 7. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 7 http://dl00.org 2005
  8. 8. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 8 http://dl00.org 2008
  9. 9. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 9 Content Drift (in legal documents)
  10. 10. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 10
  11. 11. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 11 Content Drift (in scholarly articles)
  12. 12. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 12 Referenced in http://dx.doi.org/10.1016/j.nuclphysa.2009.05.110 published on August 15th 2009 May 8th 2009 August 27th 2009
  13. 13. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 13 Referenced in http://arxiv.org/abs/astro-ph/9707064 published on July 4th 1997 June 7th 1997 today
  14. 14. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 14 ArXiv Corpus 1997 1999 2001 2003 2005 2007 2009 2011 02000060000100000140000180000 articles URI references
  15. 15. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 15 http://hiberlink.org/ Definition: • Link Rot + Content Drift = Reference Rot Observation: • Links to these resources are subject to Reference Rot • Web at large resources referenced in scholarly articles Problem: • Threat to integrity of the web-based scholarly record • Resources do not have the same sense of fixity like e.g., journal articles • Resources’ custodianship is different, in terms of long- term archiving, integrity, and access
  16. 16. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 16 http://dx.doi.org/10.1371/journal.pone.0115253
  17. 17. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 17 Focus: Content Drift
  18. 18. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 18 http://dx.doi.org/10.1371/journal.pone.0167475
  19. 19. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 19 Study Dataset • 3.5 million articles from arXiv, Elsevier, PMC • Published between Jan 1997 – Dec 2012 • Converted from PDF to XML • Extraction of URIs to web at large resources (>1 million) • Keep track of articles’ publication date
  20. 20. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 20 Novel Approach to Assess Content Drift
  21. 21. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 21 Step 1: Find Mementos • ~ 1 million URI references • ~ 650k Memento Pre/Post pairs discovered via Memento https://mementoweb.org https://tools.ietf.org/html/rfc7089 t t+1t-1
  22. 22. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 22 Step 2: Select Representative Mementos
  23. 23. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 23 • Apply content similarity measures • How similar is representative? Step 2: Select Representative Mementos
  24. 24. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 24 Content Similarity Measures • Compute normalized scores (values between 0...100) for: • Simhash • Jaccard • Sørensen-Dice • Cosine
  25. 25. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 25 Representative Mementos • Idea • If perfect score in all 4 similarity measures  Memento Pre and Post are the same  Representative Mementos • Sanity check needed • Via HTTP headers: E-Tag and Last-Modified • If same for Pre and Post Memento  HTTP-same • Sanity check passed! • 98.88% of Memento pairs that are HTTP-same have perfect score in all 4 similarity measures
  26. 26. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 26 • ~ 313k referenced URIs have representative Mementos Step 2: Select Representative Mementos
  27. 27. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 27 Representative Mementos in arXiv
  28. 28. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 28 arXiv Elsevier PMC
  29. 29. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 29 • 241k out of 313k URIs have a live web version Step 3: Dereference Live Web Version of URI
  30. 30. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 30 Step 4: Representative Memento vs. Live Version • Apply content similarity measures • Bin results into 6 clusters
  31. 31. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 31
  32. 32. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 32 Aggregate Similarity Score Good: 23.7% of URIs have *not* drifted! Bad: 3/4 URIs *have* drifted!
  33. 33. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 33 Content Drift & Link Rot Over Time - arXiv
  34. 34. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 34 arXiv Elsevier PMC
  35. 35. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 35 Take-Aways 1. Scholarly articles increasingly contain URI references to web at large resources. 2. Such resources are subject to reference rot (link rot + content drift). 3. Custodians of these resources are typically not overly concerned with archiving of their content and longevity of the scholarly record. 4. Spoiler: Authors, publishers, web archives, and other parties can help tackle this problem (see my lightning talk + poster on Robust Links).
  36. 36. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK Using the Memento Framework to Assess Content Drift in Scholarly Communication Martin Klein @mart1nkle1n Herbert Van de Sompel @hvdsomp Research Library Los Alamos National Laboratory

×