A Clean Slate?


Published on

Keynote presentation delivered at ELAG 2013 in Gent, Belgium, on May 29 2013. Discusses Research Objects and the relationship to work my team has been involved in during the past couple of years: OAI-ORE, Open Annotation, Memento.

Published in: Technology
  1. 1. A Clean Slate?@hvdsomp van de sompelIncludes slides by Sean Bechhofer, Carole Goble, Robert Sanderson
  2. 2. paper-based scholarly communication systemscanned version of paper-based scholarly communication systemnatively digital, web-based, scholarly communication systemContext of My Work, My Talkpainful  transi,on  
  3. 3. In Silico (Computational) ScienceDatasetsData collectionsAlgorithmsConfigurationsTools and AppsCodesCode LibrariesServices,Infrastructure,CompilersHardwareSimulations, data exploration, data processing, analytics, database based, textmining, auto recommendation, visual analytics…Actually Digital Science is justScienceCarole Goble, JCDL 2012 Keynote
  4. 4. Scientific Workflows, Services, Data, Workflow Engines  Carole Goble, JCDL 2012 Keynote componentscontinuously influx. How toreproduce resultsin such anenvironment?
  5. 5. A Lot of Rs for Reproducibility•  Rerun re-execute original experiment using revised setting.•  Review Validate and justify the results empirically. Trust.Understand. Train. Convincing and comfort•  Replicate / Repeat Exactly replicate the original experiment.Eliminate change.•  Reproduce Run experiment with differences in elements (materials,methods, platform or setting) and compare to test for same result.•  Replay Run through what happened using logs without originalplatform or need to execute.Carole Goble, JCDL 2012 Keynote
  6. 6. A Lot of Rs for Reuse•  Refresh execute an upgraded original experiment.•  Reconstruct rebuild using new elements or different platform whenthey are lost/unavailable/inaccessible•  Reuse use as part of new experiments.•  Repurpose/Reassemble reuse elements in a new experimentCarole Goble, JCDL 2012 Keynote
  7. 7. The Article is the Knowledge Bottleneck“An article about computational science in a scientificpublication is not the scholarship itself, it is merelyadvertising of the scholarship. The actual scholarship is thecomplete software development environment, [the completedata] and the complete set of instructions which generatedthe figures.”Backheit, J. and Donoho, D. (1995) Wavelab and reproducible research
  8. 8. The Article is the Knowledge Bottleneck“Changes are occurring in the ways in which scientificresearch is conducted. Within e-laboratories, methods suchas scientific workflows, research protocols, standardoperating procedures and algorithms for analysis andsimulation are used to manipulate and produce data.Experimental or observational data and scientific models aretypically born digital with no physical counterpart. This moveto digital content is driving a sea-change in scientificpublication, and challenging traditional scholarlypublication.”Bechhofer S. et al (2010) Research Objects: Towards Exchange and Reuse of DigitalKnowledge
  9. 9. •  Involved in each such experiment is a complex set of resourceswith complex relationships•  There is a need to share these resources in order to supportforms of reuse, reproducibility•  This entails the augmentation of the scholarly record withan explicit account of the research process•  Digital exchange of each resource individually is trivial,exchange of the combined knowledge is not•  Traditional, electronic publications, can not handle this job•  Targeted at humans, not machines•  Communicates findings not all scientific knowledge behindthe findings•  Content not decomposable in actionable units•  Outputs, results, methods not reusableIf not the Article, then What?Bechhofer S. et al (2010) Research Objects: Towards Exchange and Reuse of DigitalKnowledge
  10. 10. The Clean Slate Challenge
  11. 11. The Clean Slate ChallengeAdd features tosupport theseneeds to theexisting scholarlycommunicationsystem?
  12. 12. The Clean Slate ChallengeStart witha clean slate?
  13. 13. Research Objects
  14. 14. Research Objects: Aggregated Content•  Data used or results produced inan experiment study•  Methods employed to produce andanalyze that data•  Provenance and settinginformation about the experiments•  People involved in theinvestigation•  Annotations about theseresources, that are essential to theunderstanding and interpretation ofthe scientific outcomes capturedby a research object.
  15. 15.
  16. 16. Research Objects
  17. 17. Research Objects: Aggregation“Research Objects are aggregations of content. Thus aResearch Object framework needs to provide a mechanismfor this aggregation. Aggregations are likely to includereferences to resources but there may also, however, besituations, where, for reasons of efficiency or in order tosupport persistence, Research Objects should also be ableto aggregate literal data as well as references to data.”Bechhofer S. et al (2010) Research Objects: Towards Exchange and Reuse of DigitalKnowledge
  18. 18. •  OAI-ORE observation: Scholarly assets arerapidly becoming compound, consisting ofmultiple resources•  e.g. datasets, software, ontologies,workflows, online debate, slides, blogs,videos, etc.with various:•  Relationships•  Interdependencies•  How to convey this compound-ness in aninteroperable manner so that applicationscan access, consume such assets?2007  Funded by the Mellon Foundation & Microsoft Research
  19. 19. Foundations of the ORE Solution•  Web Architecture - Resource, URI, Representation•  Semantic Web:•  URIs for documents (information resources),•  URIs for physical entities, concepts, abstractions (non-informationresources)•  RDF – to express properties, relationships pertaining to resources•  Linked Data:•  HTTP URIs for both information and non-information resources•  HTTP 303 redirect:•  From: The HTTP URI of non-information resource•  To: The HTTP URI of an information resource that describesthe non-information resource
  20. 20. Adding Account of Research Life Cycles to Scholarly RecordPepe, A., Mayernik, M., Borgman, C., Van de Sompel, H. (2009) Technology toRepresent Scientific Practice: Data, Life Cycles, and Value Chains.
  21. 21. ORE & Research Objects“…, Research Objects should also be able to aggregate literal data aswell as references to data.”•  Aggregated Resources in ORE have HTTP URIs; probably needs tobe relaxed.•  Embedding content in RDF, irrespective of ORE, is … interesting•  See: Representing Content in RDF 1.0•  Allows embedding base64, text, XML•  Resource Map as manifest in e.g. ZIP file?
  22. 22. Research Objects
  23. 23. Research Objects: Annotation“Annotations about these resources, that are essential to theunderstanding and interpretation of the scientific outcomescaptured by a research object.”
  24. 24. •  Annotation is a pervasive scholarly activity,conducted by people and machines•  Many annotation efforts and tools•  But annotations stuck in silos:•  Only consumable by client that createdit•  Annotations not shareable beyondoriginal environment•  Open Annotation focuses on interoperabilityfor annotations in order to allow sharing ofannotations across:•  Annotation clients•  Content collections•  Services that leverage annotations2009  Funded by the Mellon Foundation
  25. 25. •  Established to reconcile Open Annotation Collaboration andAnnotation Ontology models•  67 participants from around the world: 7th of 119 groupsMany universities, also commercial and not-for-profit•  Mission:Interoperability between Annotation systems and platforms, by…following the Architecture of the Web…reusing existing web standards…providing a single, coherent model to implement…without requiring adoption of specific platforms…while maintaining low implementation costsW3C Open Annotation Community Group
  26. 26. An Annotation is considered to be a set of connectedresources, typically including a body and target, wherethe body is related to the target.“   ”  Highlighting, BookmarkingCommenting, DescribingTagging, LinkingClassifying, IdentifyingQuestioning, ReplyingEditing, Moderating…Provide an Aide-Memoire…Share and Inform…Improve Discovery…Organize Resources…Interact with Others…Create as well as ConsumeWhat is an Annotation?
  27. 27. Annotates  Annotations
  28. 28. Annotates?  Annotations?
  29. 29. Basic Open Annotation Data Model
  30. 30. Use Case: Bookmarking
  31. 31. Use Case: Commenting
  32. 32. Use Case: Commenting
  33. 33. Use Case: Tagging
  34. 34. Specific Body and Specific Target resources identify the region ofinterest, and/or the state of the resource.Need to be able to describe the state of the resource, the segmentof interest, and potentially styling hints for how to render it.Open Annotation introduces:State Describes how to retrieve representationSelector Describes how to select segmentStyle Describes how to render/process segmentScope Describes context of the resourceFurther Specification of Resources
  35. 35. Use Case: Changing Content at the Same URI
  36. 36. Use Case: Segment of Interest
  37. 37. W3C Open Annotation & Research Objects•  Early renderings of Research Objects emerging from the Wf4Everproject use Annotation Ontology as the annotation framework•  But since the Annotation Ontology and Open Annotation Collaborationmodels now merge into the W3C Open Annotation model, it is safe toassume W3C Open Annotation will be used for Research Objects
  38. 38. Research Objects
  39. 39. Research Objects: Versioning and Evolution“Research Objects are dynamic in that their contents canchange and be changed – additional contents may beadded to aggregations, or additional metadata can beasserted about the contents or relationships betweencontent. The resources that are aggregated may change.Thus there is a need for versioning, allowing the recordingof changes to objects, potentially along with facilities forretrieving objects or aggregated elements at particularhistorical points in their lifecycle.”Bechhofer S. et al (2010) Research Objects: Towards Exchange and Reuse of DigitalKnowledge
  40. 40. ORE Experiment: Versioning and Evolution of Compound ObjectsVan de Sompel, H. et al. (2007) Appendix to Interoperability for the Discovery, Use, andRe-Use of Units of Scholarly Communication
  41. 41. •  Memento is about the Web and time:•  Resources evolve over time•  Only the current representation isavailable from a resource’s URI•  How to seamlessly access priorrepresentation, if they exist?•  Memento looks at this problem for the Web,in generalDigital  Preserva,on  Award  2010  2009  Funded by the Library of Congress
  42. 42. URI for Original, URI for Version  URI-­‐M  -­‐  hDp://    Web  Archive  URI-­‐R  -­‐  hDp://    
  43. 43. URI for Original, URI for Version  URI-­‐M  -­‐  hDp://,tle=September_11_aDacks&oldid=282333    CMS  URI-­‐R  -­‐  hDp://  
  44. 44. Time Travel for the Web: Demo
  45. 45. Memento & Research Objects•  The combination of:•  Pro-active archiving of Research Objects and their constituentresources, using•  Web archiving techniques, e.g. crawling, transactionalarchiving•  Platforms with strong versioning capabilities, e.g. datawikis,github•  Assigning URIs to Research Objects and their constituentresources according to the well-established time-generic (URI-R)and time-specific (URI-M) resource pattern•  The Memento protocol to access time-specific versions ofResearch Objects and their constituent resources via their time-generic URI and timestampmakes a good candidate for addressing the versioning and evolutionneed.
  46. 46. Research Objects
  47. 47. Research Objects: Provenance“The issue of provenance, and being able to auditexperiments and investigations is key to the scientificmethod. Third parties must be able to audit the stepsperformed in an experiment in order to be convinced of thevalidity of results. Audit is required not just for regulatorypurposes, but allows for the results of experiments to beinterpreted and reused, thus a Research Object shouldprovide sufficient information to support audit of theaggregation as a whole, its constituent parts, and anyprocess that it may encapsulate.”Bechhofer S. et al (2010) Research Objects: Towards Exchange and Reuse of DigitalKnowledge
  48. 48. Van de Sompel, H. (2003) Roadblocks
  49. 49. Moreau, L. et al. (2010) The Open Provenance Model: Abstract Model Provenance Model
  50. 50. W3C Provenance
  51. 51. Research Objects  PROV  
  52. 52. The Clean Slate Challenge
  53. 53. •  ResourceSync is about synchronization ofweb resources, things with a URI that canbe dereferenced•  Small websites/repositories (a fewresources) to large repositories/datasets/linked data collections (many millions ofresources)•  Low change frequency (weeks/months) tohigh change frequency (seconds)•  Synchronization latency and accuracyneeds may vary•  Modular framework based on Sitemaps andextensions2012  Funded by the Sloan Foundation
  54. 54. •  Investigates reference rot at massive scale:•  Citation rot - Do HTTP references inscholarly articles still resolve?•  Content rot - If so, is the content at theend of the HTTP reference stillrepresentative of the content that wasoriginally referenced?•  Investigates pro-active ways to archiveHTTP referenced resources that occur inscholarly articles2013  hiberlinkFunded by the Mellon FoundationSoon at
  55. 55. Research Objects
  56. 56.
  57. 57. A Clean Slate?@hvdsomp van de sompelIncludes slides by Sean Bechhofer, Carole Goble, Robert Sanderson