Keynote presentation delivered at ELAG 2013 in Gent, Belgium, on May 29 2013. Discusses Research Objects and the relationship to work my team has been involved in during the past couple of years: OAI-ORE, Open Annotation, Memento.
A Clean Slate?@hvdsomphttp://public.lanl.gov/herbertv/herbert van de sompelIncludes slides by Sean Bechhofer, Carole Goble, Robert Sanderson
paper-based scholarly communication systemscanned version of paper-based scholarly communication systemnatively digital, web-based, scholarly communication systemContext of My Work, My Talkpainful transi,on
In Silico (Computational) ScienceDatasetsData collectionsAlgorithmsConfigurationsTools and AppsCodesCode LibrariesServices,Infrastructure,CompilersHardwareSimulations, data exploration, data processing, analytics, database based, textmining, auto recommendation, visual analytics…Actually Digital Science is justScienceCarole Goble, JCDL 2012 Keynotehttps://dl.dropbox.com/u/617206/JCDL2012keynoteGoble.ppt
Scientific Workflows, Services, Data, Workflow Engines Carole Goble, JCDL 2012 Keynotehttps://dl.dropbox.com/u/617206/JCDL2012keynoteGoble.pptAll componentscontinuously influx. How toreproduce resultsin such anenvironment?
A Lot of Rs for Reproducibility• Rerun re-execute original experiment using revised setting.• Review Validate and justify the results empirically. Trust.Understand. Train. Convincing and comfort• Replicate / Repeat Exactly replicate the original experiment.Eliminate change.• Reproduce Run experiment with differences in elements (materials,methods, platform or setting) and compare to test for same result.• Replay Run through what happened using logs without originalplatform or need to execute.Carole Goble, JCDL 2012 Keynotehttps://dl.dropbox.com/u/617206/JCDL2012keynoteGoble.ppt
A Lot of Rs for Reuse• Refresh execute an upgraded original experiment.• Reconstruct rebuild using new elements or different platform whenthey are lost/unavailable/inaccessible• Reuse use as part of new experiments.• Repurpose/Reassemble reuse elements in a new experimentCarole Goble, JCDL 2012 Keynotehttps://dl.dropbox.com/u/617206/JCDL2012keynoteGoble.ppt
The Article is the Knowledge Bottleneck“An article about computational science in a scientificpublication is not the scholarship itself, it is merelyadvertising of the scholarship. The actual scholarship is thecomplete software development environment, [the completedata] and the complete set of instructions which generatedthe figures.”Backheit, J. and Donoho, D. (1995) Wavelab and reproducible research http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.3.2982
The Article is the Knowledge Bottleneck“Changes are occurring in the ways in which scientificresearch is conducted. Within e-laboratories, methods suchas scientific workflows, research protocols, standardoperating procedures and algorithms for analysis andsimulation are used to manipulate and produce data.Experimental or observational data and scientific models aretypically born digital with no physical counterpart. This moveto digital content is driving a sea-change in scientificpublication, and challenging traditional scholarlypublication.”Bechhofer S. et al (2010) Research Objects: Towards Exchange and Reuse of DigitalKnowledge http://dx.doi.org/10.1038/npre.2010.4626.1
• Involved in each such experiment is a complex set of resourceswith complex relationships• There is a need to share these resources in order to supportforms of reuse, reproducibility• This entails the augmentation of the scholarly record withan explicit account of the research process• Digital exchange of each resource individually is trivial,exchange of the combined knowledge is not• Traditional, electronic publications, can not handle this job• Targeted at humans, not machines• Communicates findings not all scientific knowledge behindthe findings• Content not decomposable in actionable units• Outputs, results, methods not reusableIf not the Article, then What?Bechhofer S. et al (2010) Research Objects: Towards Exchange and Reuse of DigitalKnowledge http://dx.doi.org/10.1038/npre.2010.4626.1
The Clean Slate ChallengeAdd features tosupport theseneeds to theexisting scholarlycommunicationsystem?
The Clean Slate ChallengeStart witha clean slate?
Research Objectshttp://www.researchobject.org/ http://www.wf4ever-project.org/
Research Objects: Aggregated Content• Data used or results produced inan experiment study• Methods employed to produce andanalyze that data• Provenance and settinginformation about the experiments• People involved in theinvestigation• Annotations about theseresources, that are essential to theunderstanding and interpretation ofthe scientific outcomes capturedby a research object.http://www.researchobject.org/
Research Objects: Aggregation“Research Objects are aggregations of content. Thus aResearch Object framework needs to provide a mechanismfor this aggregation. Aggregations are likely to includereferences to resources but there may also, however, besituations, where, for reasons of efficiency or in order tosupport persistence, Research Objects should also be ableto aggregate literal data as well as references to data.”Bechhofer S. et al (2010) Research Objects: Towards Exchange and Reuse of DigitalKnowledge http://dx.doi.org/10.1038/npre.2010.4626.1
• OAI-ORE observation: Scholarly assets arerapidly becoming compound, consisting ofmultiple resources• e.g. datasets, software, ontologies,workflows, online debate, slides, blogs,videos, etc.with various:• Relationships• Interdependencies• How to convey this compound-ness in aninteroperable manner so that applicationscan access, consume such assets?2007 Funded by the Mellon Foundation & Microsoft Researchhttp://www.openarchives.org/ore/
Foundations of the ORE Solution• Web Architecture - Resource, URI, Representation• Semantic Web:• URIs for documents (information resources),• URIs for physical entities, concepts, abstractions (non-informationresources)• RDF – to express properties, relationships pertaining to resources• Linked Data:• HTTP URIs for both information and non-information resources• HTTP 303 redirect:• From: The HTTP URI of non-information resource• To: The HTTP URI of an information resource that describesthe non-information resource
Adding Account of Research Life Cycles to Scholarly RecordPepe, A., Mayernik, M., Borgman, C., Van de Sompel, H. (2009) Technology toRepresent Scientific Practice: Data, Life Cycles, and Value Chains. http://dx.doi.org/10.1002/asi21263
ORE & Research Objects“…, Research Objects should also be able to aggregate literal data aswell as references to data.”• Aggregated Resources in ORE have HTTP URIs; probably needs tobe relaxed.• Embedding content in RDF, irrespective of ORE, is … interesting• See: Representing Content in RDF 1.0 http://www.w3.org/TR/Content-in-RDF10/• Allows embedding base64, text, XML• Resource Map as manifest in e.g. ZIP file?
Research Objects: Annotation“Annotations about these resources, that are essential to theunderstanding and interpretation of the scientific outcomescaptured by a research object.”http://www.researchobject.org/
• Annotation is a pervasive scholarly activity,conducted by people and machines• Many annotation efforts and tools• But annotations stuck in silos:• Only consumable by client that createdit• Annotations not shareable beyondoriginal environment• Open Annotation focuses on interoperabilityfor annotations in order to allow sharing ofannotations across:• Annotation clients• Content collections• Services that leverage annotations2009 Funded by the Mellon Foundationhttp://www.openanotation.org/spec/core/
• Established to reconcile Open Annotation Collaboration andAnnotation Ontology models• 67 participants from around the world: 7th of 119 groupsMany universities, also commercial and not-for-profit• Mission:Interoperability between Annotation systems and platforms, by…following the Architecture of the Web…reusing existing web standards…providing a single, coherent model to implement…without requiring adoption of specific platforms…while maintaining low implementation costsW3C Open Annotation Community Grouphttp://www.w3.org/community/openannotation/
An Annotation is considered to be a set of connectedresources, typically including a body and target, wherethe body is related to the target.“ ” Highlighting, BookmarkingCommenting, DescribingTagging, LinkingClassifying, IdentifyingQuestioning, ReplyingEditing, Moderating…Provide an Aide-Memoire…Share and Inform…Improve Discovery…Organize Resources…Interact with Others…Create as well as ConsumeWhat is an Annotation?http://www.w3.org/community/openannotation/
Specific Body and Specific Target resources identify the region ofinterest, and/or the state of the resource.Need to be able to describe the state of the resource, the segmentof interest, and potentially styling hints for how to render it.Open Annotation introduces:State Describes how to retrieve representationSelector Describes how to select segmentStyle Describes how to render/process segmentScope Describes context of the resourceFurther Specification of Resources
W3C Open Annotation & Research Objects• Early renderings of Research Objects emerging from the Wf4Everproject use Annotation Ontology as the annotation framework• But since the Annotation Ontology and Open Annotation Collaborationmodels now merge into the W3C Open Annotation model, it is safe toassume W3C Open Annotation will be used for Research Objects
Research Objects: Versioning and Evolution“Research Objects are dynamic in that their contents canchange and be changed – additional contents may beadded to aggregations, or additional metadata can beasserted about the contents or relationships betweencontent. The resources that are aggregated may change.Thus there is a need for versioning, allowing the recordingof changes to objects, potentially along with facilities forretrieving objects or aggregated elements at particularhistorical points in their lifecycle.”Bechhofer S. et al (2010) Research Objects: Towards Exchange and Reuse of DigitalKnowledge http://dx.doi.org/10.1038/npre.2010.4626.1
ORE Experiment: Versioning and Evolution of Compound ObjectsVan de Sompel, H. et al. (2007) Appendix to Interoperability for the Discovery, Use, andRe-Use of Units of Scholarly Communicationhttp://www.ctwatch.org/quarterly/articles/2007/08/interoperability-for-the-discovery-use-and-re-use-of-units-of-scholarly-communication/
• Memento is about the Web and time:• Resources evolve over time• Only the current representation isavailable from a resource’s URI• How to seamlessly access priorrepresentation, if they exist?• Memento looks at this problem for the Web,in generalDigital Preserva,on Award 2010 2009 Funded by the Library of Congresshttp://www.mementoweb.org/
URI for Original, URI for Version URI-‐M -‐ hDp://web.archive.org/web/20010911203610/hDp://www.cnn.com/ Web Archive URI-‐R -‐ hDp://www.cnn.com/
URI for Original, URI for Version URI-‐M -‐ hDp://en.wikipedia.org/w/index.php?,tle=September_11_aDacks&oldid=282333 CMS URI-‐R -‐ hDp://en.wikipedia.org/wiki/September_11_aDacks
Time Travel for the Web: Demo http://www.mementoweb.org/demo/Memento_Time_Travel.mov
Memento & Research Objects• The combination of:• Pro-active archiving of Research Objects and their constituentresources, using• Web archiving techniques, e.g. crawling, transactionalarchiving• Platforms with strong versioning capabilities, e.g. datawikis,github• Assigning URIs to Research Objects and their constituentresources according to the well-established time-generic (URI-R)and time-specific (URI-M) resource pattern• The Memento protocol to access time-specific versions ofResearch Objects and their constituent resources via their time-generic URI and timestampmakes a good candidate for addressing the versioning and evolutionneed.
Research Objects: Provenance“The issue of provenance, and being able to auditexperiments and investigations is key to the scientificmethod. Third parties must be able to audit the stepsperformed in an experiment in order to be convinced of thevalidity of results. Audit is required not just for regulatorypurposes, but allows for the results of experiments to beinterpreted and reused, thus a Research Object shouldprovide sufficient information to support audit of theaggregation as a whole, its constituent parts, and anyprocess that it may encapsulate.”Bechhofer S. et al (2010) Research Objects: Towards Exchange and Reuse of DigitalKnowledge http://dx.doi.org/10.1038/npre.2010.4626.1
Van de Sompel, H. (2003) Roadblocks http://www.sis.pitt.edu/~dlwkshop/paper_sompel.htmlProvenance
Moreau, L. et al. (2010) The Open Provenance Model: Abstract Modelhttp://eprints.ecs.soton.ac.uk/21449/Open Provenance Model
• ResourceSync is about synchronization ofweb resources, things with a URI that canbe dereferenced• Small websites/repositories (a fewresources) to large repositories/datasets/linked data collections (many millions ofresources)• Low change frequency (weeks/months) tohigh change frequency (seconds)• Synchronization latency and accuracyneeds may vary• Modular framework based on Sitemaps andextensions2012 Funded by the Sloan Foundationhttp://www.openarchives.org/rs/
• Investigates reference rot at massive scale:• Citation rot - Do HTTP references inscholarly articles still resolve?• Content rot - If so, is the content at theend of the HTTP reference stillrepresentative of the content that wasoriginally referenced?• Investigates pro-active ways to archiveHTTP referenced resources that occur inscholarly articles2013 hiberlinkFunded by the Mellon FoundationSoon at http://www.hiberlink.org
Research Objectshttp://www.researchobject.org/ http://www.wf4ever-project.org/