Keynote presentation delivered at ELAG 2013 in Gent, Belgium, on May 29 2013. Discusses Research Objects and the relationship to work my team has been involved in during the past couple of years: OAI-ORE, Open Annotation, Memento.
2. paper-based scholarly communication system
scanned version of paper-based scholarly communication system
natively digital, web-based, scholarly communication system
Context of My Work, My Talk
painful
transi,on
3. In Silico (Computational) Science
Datasets
Data collections
Algorithms
Configurations
Tools and Apps
Codes
Code Libraries
Services,
Infrastructure,
Compilers
Hardware
Simulations, data exploration, data processing, analytics, database based, text
mining, auto recommendation, visual analytics…Actually Digital Science is just
Science
Carole Goble, JCDL 2012 Keynote
https://dl.dropbox.com/u/617206/JCDL2012keynoteGoble.ppt
4. Scientific Workflows, Services, Data, Workflow Engines
Carole Goble, JCDL 2012 Keynote
https://dl.dropbox.com/u/617206/JCDL2012keynoteGoble.ppt
All components
continuously in
flux. How to
reproduce results
in such an
environment?
5. A Lot of Rs for Reproducibility
• Rerun re-execute original experiment using revised setting.
• Review Validate and justify the results empirically. Trust.
Understand. Train. Convincing and comfort
• Replicate / Repeat Exactly replicate the original experiment.
Eliminate change.
• Reproduce Run experiment with differences in elements (materials,
methods, platform or setting) and compare to test for same result.
• Replay Run through what happened using logs without original
platform or need to execute.
Carole Goble, JCDL 2012 Keynote
https://dl.dropbox.com/u/617206/JCDL2012keynoteGoble.ppt
6. A Lot of Rs for Reuse
• Refresh execute an upgraded original experiment.
• Reconstruct rebuild using new elements or different platform when
they are lost/unavailable/inaccessible
• Reuse use as part of new experiments.
• Repurpose/Reassemble reuse elements in a new experiment
Carole Goble, JCDL 2012 Keynote
https://dl.dropbox.com/u/617206/JCDL2012keynoteGoble.ppt
7. The Article is the Knowledge Bottleneck
“An article about computational science in a scientific
publication is not the scholarship itself, it is merely
advertising of the scholarship. The actual scholarship is the
complete software development environment, [the complete
data] and the complete set of instructions which generated
the figures.”
Backheit, J. and Donoho, D. (1995) Wavelab and reproducible research http://
citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.3.2982
8. The Article is the Knowledge Bottleneck
“Changes are occurring in the ways in which scientific
research is conducted. Within e-laboratories, methods such
as scientific workflows, research protocols, standard
operating procedures and algorithms for analysis and
simulation are used to manipulate and produce data.
Experimental or observational data and scientific models are
typically born digital with no physical counterpart. This move
to digital content is driving a sea-change in scientific
publication, and challenging traditional scholarly
publication.”
Bechhofer S. et al (2010) Research Objects: Towards Exchange and Reuse of Digital
Knowledge http://dx.doi.org/10.1038/npre.2010.4626.1
9. • Involved in each such experiment is a complex set of resources
with complex relationships
• There is a need to share these resources in order to support
forms of reuse, reproducibility
• This entails the augmentation of the scholarly record with
an explicit account of the research process
• Digital exchange of each resource individually is trivial,
exchange of the combined knowledge is not
• Traditional, electronic publications, can not handle this job
• Targeted at humans, not machines
• Communicates findings not all scientific knowledge behind
the findings
• Content not decomposable in actionable units
• Outputs, results, methods not reusable
If not the Article, then What?
Bechhofer S. et al (2010) Research Objects: Towards Exchange and Reuse of Digital
Knowledge http://dx.doi.org/10.1038/npre.2010.4626.1
14. Research Objects: Aggregated Content
• Data used or results produced in
an experiment study
• Methods employed to produce and
analyze that data
• Provenance and setting
information about the experiments
• People involved in the
investigation
• Annotations about these
resources, that are essential to the
understanding and interpretation of
the scientific outcomes captured
by a research object.
http://www.researchobject.org/
17. Research Objects: Aggregation
“Research Objects are aggregations of content. Thus a
Research Object framework needs to provide a mechanism
for this aggregation. Aggregations are likely to include
references to resources but there may also, however, be
situations, where, for reasons of efficiency or in order to
support persistence, Research Objects should also be able
to aggregate literal data as well as references to data.”
Bechhofer S. et al (2010) Research Objects: Towards Exchange and Reuse of Digital
Knowledge http://dx.doi.org/10.1038/npre.2010.4626.1
18. • OAI-ORE observation: Scholarly assets are
rapidly becoming compound, consisting of
multiple resources
• e.g. datasets, software, ontologies,
workflows, online debate, slides, blogs,
videos, etc.
with various:
• Relationships
• Interdependencies
• How to convey this compound-ness in an
interoperable manner so that applications
can access, consume such assets?
2007
Funded by the Mellon Foundation & Microsoft Research
http://www.openarchives.org/ore/
21. Foundations of the ORE Solution
• Web Architecture - Resource, URI, Representation
• Semantic Web:
• URIs for documents (information resources),
• URIs for physical entities, concepts, abstractions (non-information
resources)
• RDF – to express properties, relationships pertaining to resources
• Linked Data:
• HTTP URIs for both information and non-information resources
• HTTP 303 redirect:
• From: The HTTP URI of non-information resource
• To: The HTTP URI of an information resource that describes
the non-information resource
30. Adding Account of Research Life Cycles to Scholarly Record
Pepe, A., Mayernik, M., Borgman, C., Van de Sompel, H. (2009) Technology to
Represent Scientific Practice: Data, Life Cycles, and Value Chains. http://dx.doi.org/
10.1002/asi21263
31. ORE & Research Objects
“…, Research Objects should also be able to aggregate literal data as
well as references to data.”
• Aggregated Resources in ORE have HTTP URIs; probably needs to
be relaxed.
• Embedding content in RDF, irrespective of ORE, is … interesting
• See: Representing Content in RDF 1.0 http://www.w3.org/TR/
Content-in-RDF10/
• Allows embedding base64, text, XML
• Resource Map as manifest in e.g. ZIP file?
33. Research Objects: Annotation
“Annotations about these resources, that are essential to the
understanding and interpretation of the scientific outcomes
captured by a research object.”
http://www.researchobject.org/
34. • Annotation is a pervasive scholarly activity,
conducted by people and machines
• Many annotation efforts and tools
• But annotations stuck in silos:
• Only consumable by client that created
it
• Annotations not shareable beyond
original environment
• Open Annotation focuses on interoperability
for annotations in order to allow sharing of
annotations across:
• Annotation clients
• Content collections
• Services that leverage annotations
2009
Funded by the Mellon Foundation
http://www.openanotation.org/spec/core/
35. • Established to reconcile Open Annotation Collaboration and
Annotation Ontology models
• 67 participants from around the world: 7th of 119 groups
Many universities, also commercial and not-for-profit
• Mission:
Interoperability between Annotation systems and platforms, by
…following the Architecture of the Web
…reusing existing web standards
…providing a single, coherent model to implement
…without requiring adoption of specific platforms
…while maintaining low implementation costs
W3C Open Annotation Community Group
http://www.w3.org/community/openannotation/
36. An Annotation is considered to be a set of connected
resources, typically including a body and target, where
the body is related to the target.
“
”
Highlighting, Bookmarking
Commenting, Describing
Tagging, Linking
Classifying, Identifying
Questioning, Replying
Editing, Moderating
…Provide an Aide-Memoire
…Share and Inform
…Improve Discovery
…Organize Resources
…Interact with Others
…Create as well as Consume
What is an Annotation?
http://www.w3.org/community/openannotation/
44. Specific Body and Specific Target resources identify the region of
interest, and/or the state of the resource.
Need to be able to describe the state of the resource, the segment
of interest, and potentially styling hints for how to render it.
Open Annotation introduces:
State Describes how to retrieve representation
Selector Describes how to select segment
Style Describes how to render/process segment
Scope Describes context of the resource
Further Specification of Resources
47. W3C Open Annotation & Research Objects
• Early renderings of Research Objects emerging from the Wf4Ever
project use Annotation Ontology as the annotation framework
• But since the Annotation Ontology and Open Annotation Collaboration
models now merge into the W3C Open Annotation model, it is safe to
assume W3C Open Annotation will be used for Research Objects
49. Research Objects: Versioning and Evolution
“Research Objects are dynamic in that their contents can
change and be changed – additional contents may be
added to aggregations, or additional metadata can be
asserted about the contents or relationships between
content. The resources that are aggregated may change.
Thus there is a need for versioning, allowing the recording
of changes to objects, potentially along with facilities for
retrieving objects or aggregated elements at particular
historical points in their lifecycle.”
Bechhofer S. et al (2010) Research Objects: Towards Exchange and Reuse of Digital
Knowledge http://dx.doi.org/10.1038/npre.2010.4626.1
50. ORE Experiment: Versioning and Evolution of Compound Objects
Van de Sompel, H. et al. (2007) Appendix to Interoperability for the Discovery, Use, and
Re-Use of Units of Scholarly Communication
http://www.ctwatch.org/quarterly/articles/2007/08/interoperability-for-the-discovery-use-
and-re-use-of-units-of-scholarly-communication/
51. • Memento is about the Web and time:
• Resources evolve over time
• Only the current representation is
available from a resource’s URI
• How to seamlessly access prior
representation, if they exist?
• Memento looks at this problem for the Web,
in general
Digital
Preserva,on
Award
2010
2009
Funded by the Library of Congress
http://www.mementoweb.org/
52. URI for Original, URI for Version
URI-‐M
-‐
hDp://web.archive.org/web/20010911203610/hDp://www.cnn.com/
Web
Archive
URI-‐R
-‐
hDp://www.cnn.com/
53. URI for Original, URI for Version
URI-‐M
-‐
hDp://en.wikipedia.org/w/index.php?,tle=September_11_aDacks&oldid=282333
CMS
URI-‐R
-‐
hDp://en.wikipedia.org/wiki/September_11_aDacks
60. Time Travel for the Web: Demo
http://www.mementoweb.org/demo/Memento_Time_Travel.mov
65. Memento & Research Objects
• The combination of:
• Pro-active archiving of Research Objects and their constituent
resources, using
• Web archiving techniques, e.g. crawling, transactional
archiving
• Platforms with strong versioning capabilities, e.g. datawikis,
github
• Assigning URIs to Research Objects and their constituent
resources according to the well-established time-generic (URI-R)
and time-specific (URI-M) resource pattern
• The Memento protocol to access time-specific versions of
Research Objects and their constituent resources via their time-
generic URI and timestamp
makes a good candidate for addressing the versioning and evolution
need.
67. Research Objects: Provenance
“The issue of provenance, and being able to audit
experiments and investigations is key to the scientific
method. Third parties must be able to audit the steps
performed in an experiment in order to be convinced of the
validity of results. Audit is required not just for regulatory
purposes, but allows for the results of experiments to be
interpreted and reused, thus a Research Object should
provide sufficient information to support audit of the
aggregation as a whole, its constituent parts, and any
process that it may encapsulate.”
Bechhofer S. et al (2010) Research Objects: Towards Exchange and Reuse of Digital
Knowledge http://dx.doi.org/10.1038/npre.2010.4626.1
68. Van de Sompel, H. (2003) Roadblocks http://www.sis.pitt.edu/~dlwkshop/paper_sompel.html
Provenance
69. Moreau, L. et al. (2010) The Open Provenance Model: Abstract Model
http://eprints.ecs.soton.ac.uk/21449/
Open Provenance Model
73. • ResourceSync is about synchronization of
web resources, things with a URI that can
be dereferenced
• Small websites/repositories (a few
resources) to large repositories/datasets/
linked data collections (many millions of
resources)
• Low change frequency (weeks/months) to
high change frequency (seconds)
• Synchronization latency and accuracy
needs may vary
• Modular framework based on Sitemaps and
extensions
2012
Funded by the Sloan Foundation
http://www.openarchives.org/rs/
74. • Investigates reference rot at massive scale:
• Citation rot - Do HTTP references in
scholarly articles still resolve?
• Content rot - If so, is the content at the
end of the HTTP reference still
representative of the content that was
originally referenced?
• Investigates pro-active ways to archive
HTTP referenced resources that occur in
scholarly articles
2013
hiberlink
Funded by the Mellon Foundation
Soon at http://www.hiberlink.org