SlideShare a Scribd company logo
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
Using the Memento Framework
to Assess Content Drift
in Scholarly Communication
Acknowledgements:
Shawn Jones, Harihar Shankar (LANL)
Richard Tobin, Claire Grover (University of of Edinburgh)
Andy Jackson (British Library)
Martin Klein
@mart1nkle1n
Herbert Van de Sompel
@hvdsomp
Research Library
Los Alamos National Laboratory
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
2
Link Rot
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
3
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
4
Content Drift
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
5
http://dl00.org
2000
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
6
http://dl00.org
2004
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
7
http://dl00.org
2005
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
8
http://dl00.org
2008
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
9
Content Drift
(in legal documents)
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
10
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
11
Content Drift
(in scholarly articles)
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
12
Referenced in
http://dx.doi.org/10.1016/j.nuclphysa.2009.05.110
published on August 15th 2009
May 8th 2009 August 27th 2009
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
13
Referenced in
http://arxiv.org/abs/astro-ph/9707064
published on July 4th 1997
June 7th 1997 today
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
14
ArXiv
Corpus
1997 1999 2001 2003 2005 2007 2009 2011
02000060000100000140000180000
articles
URI references
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
15
http://hiberlink.org/
Definition:
• Link Rot + Content Drift = Reference Rot
Observation:
• Links to these resources are subject to Reference Rot
• Web at large resources referenced in scholarly articles
Problem:
• Threat to integrity of the web-based scholarly record
• Resources do not have the same sense of fixity like e.g.,
journal articles
• Resources’ custodianship is different, in terms of long-
term archiving, integrity, and access
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
16
http://dx.doi.org/10.1371/journal.pone.0115253
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
17
Focus: Content Drift
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
18
http://dx.doi.org/10.1371/journal.pone.0167475
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
19
Study Dataset
• 3.5 million articles from arXiv, Elsevier, PMC
• Published between Jan 1997 – Dec 2012
• Converted from PDF to XML
• Extraction of URIs to web at large resources (>1 million)
• Keep track of articles’ publication date
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
20
Novel Approach to Assess Content Drift
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
21
Step 1: Find Mementos
• ~ 1 million URI references
• ~ 650k Memento Pre/Post pairs
discovered via Memento
https://mementoweb.org
https://tools.ietf.org/html/rfc7089
t t+1t-1
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
22
Step 2: Select Representative Mementos
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
23
• Apply content similarity measures
• How similar is representative?
Step 2: Select Representative Mementos
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
24
Content Similarity Measures
• Compute normalized scores (values between 0...100) for:
• Simhash
• Jaccard
• Sørensen-Dice
• Cosine
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
25
Representative Mementos
• Idea
• If perfect score in all 4 similarity measures
 Memento Pre and Post are the same
 Representative Mementos
• Sanity check needed
• Via HTTP headers: E-Tag and Last-Modified
• If same for Pre and Post Memento
 HTTP-same
• Sanity check passed!
• 98.88% of Memento pairs that are HTTP-same have perfect
score in all 4 similarity measures
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
26
• ~ 313k referenced URIs have
representative Mementos
Step 2: Select Representative Mementos
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
27
Representative Mementos in arXiv
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
28
arXiv
Elsevier
PMC
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
29
• 241k out of 313k URIs have a live web version
Step 3: Dereference Live Web Version of URI
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
30
Step 4: Representative Memento vs. Live Version
• Apply content similarity measures
• Bin results into 6 clusters
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
31
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
32
Aggregate
Similarity
Score
Good:
23.7% of
URIs have
*not*
drifted!
Bad:
3/4 URIs
*have*
drifted!
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
33
Content Drift & Link Rot Over Time - arXiv
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
34
arXiv
Elsevier
PMC
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
35
Take-Aways
1. Scholarly articles increasingly contain URI references to web at
large resources.
2. Such resources are subject to reference rot (link rot + content drift).
3. Custodians of these resources are typically not overly concerned
with archiving of their content and longevity of the scholarly record.
4. Spoiler: Authors, publishers, web archives, and other parties can
help tackle this problem (see my lightning talk + poster on Robust
Links).
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
Using the Memento Framework
to Assess Content Drift
in Scholarly Communication
Martin Klein
@mart1nkle1n
Herbert Van de Sompel
@hvdsomp
Research Library
Los Alamos National Laboratory

More Related Content

What's hot

OAPEN-UK presentation at UCL Ebooks Event, Jun 2013
OAPEN-UK presentation at UCL Ebooks Event, Jun 2013OAPEN-UK presentation at UCL Ebooks Event, Jun 2013
OAPEN-UK presentation at UCL Ebooks Event, Jun 2013
OAPENUK
 

What's hot (20)

Introducing PRIME:Publisher, Repository and Institutional Metadata Exchange
Introducing PRIME:Publisher, Repository and Institutional Metadata ExchangeIntroducing PRIME:Publisher, Repository and Institutional Metadata Exchange
Introducing PRIME:Publisher, Repository and Institutional Metadata Exchange
 
The Ubiquity Partner Network: Enabling Library-Based Publishing
The Ubiquity Partner Network: Enabling Library-Based PublishingThe Ubiquity Partner Network: Enabling Library-Based Publishing
The Ubiquity Partner Network: Enabling Library-Based Publishing
 
Open Access is Just the Beginning: Disrupting Publishing
Open Access is Just the Beginning: Disrupting PublishingOpen Access is Just the Beginning: Disrupting Publishing
Open Access is Just the Beginning: Disrupting Publishing
 
EThOS for Academic English
EThOS for Academic EnglishEThOS for Academic English
EThOS for Academic English
 
Brian Hole Open Access - LSE 2013 talk
Brian Hole Open Access - LSE 2013 talkBrian Hole Open Access - LSE 2013 talk
Brian Hole Open Access - LSE 2013 talk
 
The Shift to Open Access Publishing
The Shift to Open Access PublishingThe Shift to Open Access Publishing
The Shift to Open Access Publishing
 
PRIME: Publisher, Repository & Institutional Metadata Exchange
PRIME: Publisher, Repository & Institutional Metadata ExchangePRIME: Publisher, Repository & Institutional Metadata Exchange
PRIME: Publisher, Repository & Institutional Metadata Exchange
 
Publishing Open Research Data
Publishing Open Research DataPublishing Open Research Data
Publishing Open Research Data
 
Disrupting Academic Publishing
Disrupting Academic PublishingDisrupting Academic Publishing
Disrupting Academic Publishing
 
The data journal: incentivizing open scholarship or 'a convenient fiction'?
The data journal: incentivizing open scholarship or 'a convenient fiction'?The data journal: incentivizing open scholarship or 'a convenient fiction'?
The data journal: incentivizing open scholarship or 'a convenient fiction'?
 
OAPEN-UK presentation at UCL Ebooks Event, Jun 2013
OAPEN-UK presentation at UCL Ebooks Event, Jun 2013OAPEN-UK presentation at UCL Ebooks Event, Jun 2013
OAPEN-UK presentation at UCL Ebooks Event, Jun 2013
 
Open Science: A New Publisher Perspective
Open Science: A New Publisher PerspectiveOpen Science: A New Publisher Perspective
Open Science: A New Publisher Perspective
 
Quantifying the impacts of investment in humanities archives
Quantifying the impacts of investment in humanities archivesQuantifying the impacts of investment in humanities archives
Quantifying the impacts of investment in humanities archives
 
Open Access eBooks and Scholarly Publishing
Open Access eBooks andScholarly PublishingOpen Access eBooks andScholarly Publishing
Open Access eBooks and Scholarly Publishing
 
Ubiquity Press: open scholarship
Ubiquity Press: open scholarshipUbiquity Press: open scholarship
Ubiquity Press: open scholarship
 
Publishing (Open) Data
Publishing (Open) DataPublishing (Open) Data
Publishing (Open) Data
 
Too Many Copies! The confusion between duplication and versioning
Too Many Copies! The confusion between duplication and versioningToo Many Copies! The confusion between duplication and versioning
Too Many Copies! The confusion between duplication and versioning
 
From Open Access to Open Data
From Open Access to Open DataFrom Open Access to Open Data
From Open Access to Open Data
 
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
 
Linking Data with sameAs: Challenges and Solutions - Workshop
Linking Data with sameAs: Challenges and Solutions - WorkshopLinking Data with sameAs: Challenges and Solutions - Workshop
Linking Data with sameAs: Challenges and Solutions - Workshop
 

Similar to Using the Memento Framework to Assess Content Drift in Scholarly Communication

Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
Martin Klein
 
Open Annotation Collaboration Introduction
Open Annotation Collaboration IntroductionOpen Annotation Collaboration Introduction
Open Annotation Collaboration Introduction
Timothy Cole
 

Similar to Using the Memento Framework to Assess Content Drift in Scholarly Communication (20)

Robust Links - a proposed solution to reference rot in scholarly communication
Robust Links - a proposed solution to reference rot in scholarly communicationRobust Links - a proposed solution to reference rot in scholarly communication
Robust Links - a proposed solution to reference rot in scholarly communication
 
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyHIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
 
Reference Rot and Linked Data: Threat and Remedy
Reference Rot and Linked Data: Threat and RemedyReference Rot and Linked Data: Threat and Remedy
Reference Rot and Linked Data: Threat and Remedy
 
Stronger together: community initiatives in journal management
Stronger together: community initiatives in journal managementStronger together: community initiatives in journal management
Stronger together: community initiatives in journal management
 
Tales from the Keepers Registry: Dr Who and the Scholarly Record
Tales from the Keepers Registry: Dr Who and the Scholarly RecordTales from the Keepers Registry: Dr Who and the Scholarly Record
Tales from the Keepers Registry: Dr Who and the Scholarly Record
 
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
 
Where data and journal content collide: what does it mean to ‘publish your da...
Where data and journal content collide: what does it mean to ‘publish your da...Where data and journal content collide: what does it mean to ‘publish your da...
Where data and journal content collide: what does it mean to ‘publish your da...
 
Actions to Ensure the Integrity and Continuity of the Scholarly Record
Actions to Ensure the Integrity and Continuity of the Scholarly Record Actions to Ensure the Integrity and Continuity of the Scholarly Record
Actions to Ensure the Integrity and Continuity of the Scholarly Record
 
Open Annotation Collaboration Introduction
Open Annotation Collaboration IntroductionOpen Annotation Collaboration Introduction
Open Annotation Collaboration Introduction
 
To the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationTo the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly Communication
 
OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.
 
Web Today, Good Tomorrow? Transactional archiving of web content
Web Today, Good Tomorrow? Transactional archiving of web contentWeb Today, Good Tomorrow? Transactional archiving of web content
Web Today, Good Tomorrow? Transactional archiving of web content
 
Ensuring the Integrity (& Continuity) of Our Record of Scholarship
Ensuring the Integrity (& Continuity) of Our Record of ScholarshipEnsuring the Integrity (& Continuity) of Our Record of Scholarship
Ensuring the Integrity (& Continuity) of Our Record of Scholarship
 
2015 NISO Forum: The Future of Library Resource Discovery
2015 NISO Forum: The Future of Library Resource Discovery2015 NISO Forum: The Future of Library Resource Discovery
2015 NISO Forum: The Future of Library Resource Discovery
 
The opac and the web
The opac and the webThe opac and the web
The opac and the web
 
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
 
Ensuring Continuity of Access To Our Published Heritage
Ensuring Continuity of Access To Our Published HeritageEnsuring Continuity of Access To Our Published Heritage
Ensuring Continuity of Access To Our Published Heritage
 
"In the Early Days of a Better Nation": Enhancing the power of metadata today...
"In the Early Days of a Better Nation": Enhancing the power of metadata today..."In the Early Days of a Better Nation": Enhancing the power of metadata today...
"In the Early Days of a Better Nation": Enhancing the power of metadata today...
 
Deconstructed and decentralized scholarly communication
Deconstructed and decentralized scholarly communicationDeconstructed and decentralized scholarly communication
Deconstructed and decentralized scholarly communication
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 

More from Martin Klein

More from Martin Klein (20)

On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
On the Persistence of Persistent Identifiers of the Scholarly Web
 On the Persistence of Persistent Identifiers of the Scholarly Web On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Who is Asking - Humans and Machines  Experience a Different Scholarly WebWho is Asking - Humans and Machines  Experience a Different Scholarly Web
Who is Asking - Humans and Machines Experience a Different Scholarly Web
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
 
Comparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSync
 
Evaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsEvaluating Memento Service Optimizations
Evaluating Memento Service Optimizations
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly Artifacts
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...
 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento Requests
 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web Archives
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly Artifacts
 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event Collections
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
 
Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web Resources
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
 
Uniform Access to Raw Mementos
Uniform Access to Raw MementosUniform Access to Raw Mementos
Uniform Access to Raw Mementos
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
 

Recently uploaded

一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
aagad
 
Article writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxArticle writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptx
abhinandnam9997
 

Recently uploaded (12)

Stay Ahead with 2024's Top Web Design Trends
Stay Ahead with 2024's Top Web Design TrendsStay Ahead with 2024's Top Web Design Trends
Stay Ahead with 2024's Top Web Design Trends
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
The AI Powered Organization-Intro to AI-LAN.pdf
The AI Powered Organization-Intro to AI-LAN.pdfThe AI Powered Organization-Intro to AI-LAN.pdf
The AI Powered Organization-Intro to AI-LAN.pdf
 
The Best AI Powered Software - Intellivid AI Studio
The Best AI Powered Software - Intellivid AI StudioThe Best AI Powered Software - Intellivid AI Studio
The Best AI Powered Software - Intellivid AI Studio
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
The Use of AI in Indonesia Election 2024: A Case Study
The Use of AI in Indonesia Election 2024: A Case StudyThe Use of AI in Indonesia Election 2024: A Case Study
The Use of AI in Indonesia Election 2024: A Case Study
 
ER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAE
 
Article writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxArticle writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptx
 

Using the Memento Framework to Assess Content Drift in Scholarly Communication

  • 1. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK Using the Memento Framework to Assess Content Drift in Scholarly Communication Acknowledgements: Shawn Jones, Harihar Shankar (LANL) Richard Tobin, Claire Grover (University of of Edinburgh) Andy Jackson (British Library) Martin Klein @mart1nkle1n Herbert Van de Sompel @hvdsomp Research Library Los Alamos National Laboratory
  • 2. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 2 Link Rot
  • 3. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 3
  • 4. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 4 Content Drift
  • 5. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 5 http://dl00.org 2000
  • 6. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 6 http://dl00.org 2004
  • 7. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 7 http://dl00.org 2005
  • 8. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 8 http://dl00.org 2008
  • 9. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 9 Content Drift (in legal documents)
  • 10. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 10
  • 11. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 11 Content Drift (in scholarly articles)
  • 12. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 12 Referenced in http://dx.doi.org/10.1016/j.nuclphysa.2009.05.110 published on August 15th 2009 May 8th 2009 August 27th 2009
  • 13. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 13 Referenced in http://arxiv.org/abs/astro-ph/9707064 published on July 4th 1997 June 7th 1997 today
  • 14. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 14 ArXiv Corpus 1997 1999 2001 2003 2005 2007 2009 2011 02000060000100000140000180000 articles URI references
  • 15. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 15 http://hiberlink.org/ Definition: • Link Rot + Content Drift = Reference Rot Observation: • Links to these resources are subject to Reference Rot • Web at large resources referenced in scholarly articles Problem: • Threat to integrity of the web-based scholarly record • Resources do not have the same sense of fixity like e.g., journal articles • Resources’ custodianship is different, in terms of long- term archiving, integrity, and access
  • 16. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 16 http://dx.doi.org/10.1371/journal.pone.0115253
  • 17. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 17 Focus: Content Drift
  • 18. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 18 http://dx.doi.org/10.1371/journal.pone.0167475
  • 19. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 19 Study Dataset • 3.5 million articles from arXiv, Elsevier, PMC • Published between Jan 1997 – Dec 2012 • Converted from PDF to XML • Extraction of URIs to web at large resources (>1 million) • Keep track of articles’ publication date
  • 20. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 20 Novel Approach to Assess Content Drift
  • 21. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 21 Step 1: Find Mementos • ~ 1 million URI references • ~ 650k Memento Pre/Post pairs discovered via Memento https://mementoweb.org https://tools.ietf.org/html/rfc7089 t t+1t-1
  • 22. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 22 Step 2: Select Representative Mementos
  • 23. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 23 • Apply content similarity measures • How similar is representative? Step 2: Select Representative Mementos
  • 24. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 24 Content Similarity Measures • Compute normalized scores (values between 0...100) for: • Simhash • Jaccard • Sørensen-Dice • Cosine
  • 25. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 25 Representative Mementos • Idea • If perfect score in all 4 similarity measures  Memento Pre and Post are the same  Representative Mementos • Sanity check needed • Via HTTP headers: E-Tag and Last-Modified • If same for Pre and Post Memento  HTTP-same • Sanity check passed! • 98.88% of Memento pairs that are HTTP-same have perfect score in all 4 similarity measures
  • 26. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 26 • ~ 313k referenced URIs have representative Mementos Step 2: Select Representative Mementos
  • 27. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 27 Representative Mementos in arXiv
  • 28. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 28 arXiv Elsevier PMC
  • 29. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 29 • 241k out of 313k URIs have a live web version Step 3: Dereference Live Web Version of URI
  • 30. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 30 Step 4: Representative Memento vs. Live Version • Apply content similarity measures • Bin results into 6 clusters
  • 31. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 31
  • 32. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 32 Aggregate Similarity Score Good: 23.7% of URIs have *not* drifted! Bad: 3/4 URIs *have* drifted!
  • 33. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 33 Content Drift & Link Rot Over Time - arXiv
  • 34. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 34 arXiv Elsevier PMC
  • 35. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 35 Take-Aways 1. Scholarly articles increasingly contain URI references to web at large resources. 2. Such resources are subject to reference rot (link rot + content drift). 3. Custodians of these resources are typically not overly concerned with archiving of their content and longevity of the scholarly record. 4. Spoiler: Authors, publishers, web archives, and other parties can help tackle this problem (see my lightning talk + poster on Robust Links).
  • 36. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK Using the Memento Framework to Assess Content Drift in Scholarly Communication Martin Klein @mart1nkle1n Herbert Van de Sompel @hvdsomp Research Library Los Alamos National Laboratory

Editor's Notes

  1. IceCube Neutrino Observatory at the University of Wisconsin http://icecube.wisc.edu
  2. Institute for Astronomy at the University of Hawaii http://www.ifa.hawaii.edu/~cowie/k_table.html
  3. Previously, archival status (14-day window) as proxy
  4. Previously, archival status (14-day window) as proxy