Presentation given at the EMTACL12 conference in Trondheim, Norway, on October 1 2012. Discusses the evolution towards a highly dynamic scholarly record (assets don't have the sense of fixity they used to have; assets are highly interdependent) and how the archiving infrastructure used for scholarly communication can not adequately deal with this dynamism.
2. Consideration 1 - A Dynamic Scholarly Record
• The scholarly record is extending with a wide range of non-
traditional assets emerging from eScience and eHumanities
endeavors.
• e.g. datasets, software, ontologies, workflows, online debate,
slides, blogs, videos, collaborative environments, etc.
• Many of these non-traditional assets:
• Do not have the sense of fixity that traditional assets such as
journal articles or books have.
• Have a wide range of dependencies on other assets.
• Even traditional assets are becoming increasingly dynamic and
dependent on other assets, which may themselves be dynamic.
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
3. PeerJ Dynamic Content
http://peerj.com - http://www.publishersweekly.com/pw/by-topic/digital/content-and-e-books/article/
52512-scholarly-publishing-2012-meet-peerj.html
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
4. Article Wikipedia Bridge
PLoS Computational Biology
http://blogs.plos.org/plos/2012/04/bridging-the-journal-wikipedia-gap/
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
5. Research Objects
Bechhofer, S. et al. (2010) http://precedings.nature.com/documents/4626/version/1
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
6. Executable Paper – Collage - Conceptual View
Nowakowski et al. (2011) The Collage Authoring Environment Procedia Computer Science v4 http://
dx.doi.org/10.1016/j.procs.2011.04.064
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
7. Executable Paper – Collage – Rendering a Paper
Nowakowski et al. (2011) The Collage Authoring Environment Procedia Computer Science v4 http://
dx.doi.org/10.1016/j.procs.2011.04.064
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
8. Scientific Workflows, Services, Data, Workflow Engines
Carole Goble, JCDL 2012 Keynote https://dl.dropbox.com/u/617206/JCDL2012keynoteGoble.ppt
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
9. What is the Scholarly Record?
• It becomes challenging to define what the scholarly record is: where
does it start and where does it end?
• Transforming from a stack of journals or a bunch of PDF files
into a dynamic network of interconnected assets and actors.
“An article about computational science in a scientific publication is not the
scholarship itself, it is merely advertising of the scholarship. The actual
scholarship is the complete software development environment, [the
complete data] and the complete set of instructions which generated the
figures.” David Donoho, “Wavelab and Reproducible Research,” 1995
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
10. Fixity is Challenged …
• The ever-evolving nature of some assets challenges the notion of
fixity as “forever frozen” and begs considering the notion of the
“state of the scholarly record at a specific moment in time”.
• Evolution from the version of record to a version of the
record.
• Whatever the boundaries of the scholarly record are, it will be
essential to be able to look back at certain assets in order to
understand how findings came about.
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
11. Consideration 2 – The Web as the Infrastructure
• For quite some time, the Web has been the conduit for scholarly
information. But, the scholarly endeavor is increasingly embedded into,
native to, the Web.
• From PDF to HTML.
• Social component: Contributors taking a central role.
• Machine component: Semantic, Linked Data technologies.
• The Web is becoming the infrastructure for the Scholarly Record.
• Long Term Sustainability: Reuse of infrastructure (network, software,
platforms, standards, etc.) that the entire world depends on.
• Integration of scholarly discourse with other Web-based discourse.
• The special requirements of Scholarly Communication (certification,
archiving, persistence, trust, annotation, metrics, …) must be addressed in
an interoperable manner within the Web infrastructure, not in some parallel
scholarly universe.
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
12. The Web as the Infrastructure: alt-metrics
http://altmetrics.org/manifesto/
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
13. http://impactstory.it/
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
14. The HTTP URI is the Identifier
• At the core of the Web are HTTP URIs.
• The Web-based scholarly record works because of HTTP URIs.
• Even when persistent identifiers are assigned to assets, contributors,
and institutions they need to be instantiated as HTTP URIs in order to do
anything useful with them on the Web.
• cf. http://dx.doi.org/…
• same for ORCID, I2, pmid, etc.
• Many non-traditional assets are born with an HTTP URI and never
obtain a persistent identifier.
• cf. presentations on SlideShare, software, ontologies, workflows,
etc.
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
15. Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
16. Existing Archival Infrastructure Assumes Fixity and Boundary
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
17. The Web Exists in the Perpetual Now
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
18. The Web Exists in the Perpetual Now
The lack of temporal capabilities of the Web has shaped our
expectations.
• We don’t object to prior versions not being available. We tolerate
404s.
• Reviewer of Memento paper at WWW 2010:
• Is there (sic) any statistics to show that many or a good number
of Web users should like to get obsolete data or resources
• Web archives are destinations, not integrated in the Web browsing
experience.
Nelson, M.L. (2012) http://arxiv.org/abs/1209.2664
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
19. Not Accessible From cnn.com
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
20. Paper Era: Publication Context
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
21. Paper Era: Publication Context
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
22. Web Era: Publication Context
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
23. Web Era: Publication Context
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
24. Several Challenges
• Archival approach and infrastructure to deal with dynamic,
interdependent content
• Referencing scholarly assets
• Recreating a version of the scholarly record
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
25. Recreating a Version of the Scholarly Record
• Is it possible to reconstruct the Web-based scholarly record as it was at
a certain point in time?
• For example, given a paper can one see the referenced/linked assets
as they were at the time of publication of the paper?
• The ability to reconstruct a version of the scholarly record will
become increasingly important as the scholarly endeavor and
discourse becomes increasingly dynamic and Web-based.
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
26. To Be Expected
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
27. Time-dependent decay of URLs published in MEDLINE abstracts
Most common types dead links were for computer programs (43%), followed by
scholarly content (38%) and databases (19%)
Wren J D, Bioinformatics, 2008;24:1381-1385
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
28. Traces of the Past Web Exist
• Content Management Systems
• Web Archives
• Transactional archives
• Search engine caches
• …
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
29. If Only It Would Be Possible to Follow a URI in Time
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
30. It is with Memento
Digital Preservation Award 2010
http://www.mementoweb.org/
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
31. Time Travel
Select Date
Today Jun 16 1997
Jun 16 1997
From
Internet Archive
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
32. June 16 1997
http://www.ntnu.no/ @ June 16 1997
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
33. Original Resources and Mementos
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
34. Bridge from Present to Past
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
35. Bridge from Past to Present
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
36. Memento Framework
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
37. Also with 404, etc.
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
38. Memento & IIPC
http://netpreserve.org/projects/memento
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
39. Memento & Wikipedia, Mediawiki
http://en.wikipedia.org/wiki/Wikipedia:Requests_for_comment/Memento
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
40. Memento & DBpedia
http://mementoweb.org/depot/native/dbpedia/
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
41. To Be Expected
NOT IN ARCHIVE
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
42. Recreating a Version of the Scholarly Record
• Is it possible to reconstruct the Web-based scholarly record as it was at
a certain point in time?
• For example, given a paper can one see the referenced materials as
they were at the time of publication of the paper?
• Example:
Van de Sompel, H., Payette, S., Erickson, J., Lagoze, C., and Warner, S.
(2004) Rethinking scholarly communication: Building the System that
Scholars Deserve. D-Lib Magazine, 10(9). doi:10.1045/september2004-
vandesompel ; http://dx.doi.org/10.1045/september2004-vandesompel
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
43. Published
September 15 2004
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
44. Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
45. Domain Gone
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
46. Archived copy
December 5 2003
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
47. Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
48. Current version
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
49. Archived copy
December 11 2004
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
50. Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
51. Resource gone
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
52. Archived copy
December 5 2003
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
53. Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
54. Resource gone
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
55. Archived copy
unavailable
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
56. Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
57. Current version
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
58. Archived copy
August 26 2003
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
59. Citation Rot Studies at Scale with Memento
• Pilot study:
• Papers from arXiv: 400,000 papers => 144,000 unique URIs
• Thesis from UNT ETD repository: 3,600 papers => 18,000 URIs
• URIs of established scholarly repositories removed (e.g. http://
dx.doi.org), i.e. focusing in on the periphery of the scholarly record.
Sanderson, R., Phillips, M., and Van de Sompel, H. (2011) Analyzing the Persistence of Referenced Web
Resources with Memento. Open Repositories 2011; Arxiv preprint. arXiv:1105.3459 ; http://arxiv.org/abs/
1105.3459
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
60. UNT
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
61. arXiv
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
62. UNT
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
63. arXiv
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
64. Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
65. DOI Redirects to R1
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
66. Later, DOI Redirects to R2, then R3
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
67. R1, R2, R3 Have Mementos
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
68. Looking for Memento of DOI with t in [t2,t3[
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
69. End Up at Wrong Memento
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
70. Introduce Temporal Awareness for DOI Resolver
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
71. End Up at Correct Memento
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
72. But … the DOI Resolver Exists in the Perpetual Now
• The latest information indicates that the DOI redirection history is
currently not maintained
• The situation is aggravated by multiple consecutive redirects at
publisher’s end (which are likely not archived because of strict
robots.txt rules)
• While HTTP DOIs help achieve long-term workable links, they
exist in the Perpetual Now like the rest of the Web’s URIs
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
73. Several Challenges
• Archival approach and infrastructure to deal with dynamic,
interdependent content
• Referencing scholarly assets
• Recreating a version of the scholarly record
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
74. Referencing Scholarly Assets
• With Memento, the same HTTP URI can function as the reference to
temporally evolving resources
• But in order to reference the appropriate temporal version, both the
HTTP URI and the desired time are needed.
• Essential for referencing resources in annotations
• A few possibilities:
• Express URI and time as is currently done in citations – human
readable, not machine actionable
• Turn the reference into a tuple: URI and machine-actionable
annotation of the URI – allows expressing fragments of
resources too
• Use DURI scheme
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
76. duri:1997-06-17:http://www.ntnu.no
http://www.ntnu.no/ @ June 16 1997
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
77. HTML5 Custom Protocol Handler
http://dev.opera.com/articles/view/html5-custom-protocol-and-content-handlers/
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
78. HTML5 Custom Protocol Handler
http://dev.opera.com/articles/view/html5-custom-protocol-and-content-handlers/
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
79. HTML5 Custom Protocol Handler
http://dev.opera.com/articles/view/html5-custom-protocol-and-content-handlers/
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
80. Referencing Scholarly Assets
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
81. Several Challenges
• Archival approach and infrastructure to deal with dynamic,
interdependent content
• Referencing scholarly assets
• Recreating a version of the scholarly record
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
82. Archival Approach
• Archiving via a combination of “curated”, “at point of interaction”,
and “in the wild” approaches:
o CMS, wikis, datawikis with solid versioning mechanisms can
play a significant role as archival hubs
o Archiving the linked context at the time of publication (cf.
WebCite), when submitted into institutional repository, etc.
o Archiving at the moment of interaction with assets: reading,
commenting, annotating, liking, tweeting, executing, etc.
o Web archives come to the rescue for “in the wild” materials.
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
83. SiteStory Transactional Archiving
http://mementoweb.github.com/SiteStory/
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
84. SiteStory Transactional Archiving
http://mementoweb.github.com/SiteStory/
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012
85. Conclusions
• Scholarly assets are increasingly dynamic and interdependent
• The existing scholarly archiving infrastructure is about fixity and
boundary
• Scholarly communication, and, as a matter of fact, the entire
scholarly endeavor is increasingly Web-native
• The Web exists in the perpetual now
• This brings along significant challenges …
Herbert Van de Sompel
Paint-Yourself-In-The-Corner Infrastructure
EMTACL 2012, Trondheim, Norway, October 1 2012