Metadata for Research Objects

Sean Bechhofer
sean.bechhofer@manchester.ac.uk
@seanbechhofer
Making Metadata Work, ISKO
London, 23rd June 2014
Metadata for
Research Objects
1

Publication
• Publications are about argumentation: Convince
the reader of the validity of a position
– Reproducible Results System: facilitates
enactment and publication of reproducible
research.
• Results are reinforced by reproducability
– Explicit representation of method.
• Verifiability as a key factor in scientific discovery.
J. Mesirov Accessible Reproducible Research Science 327(5964), p.415-416,
2010 doi:10.1126/science.1179653
Stodden et. al. Reproducible Research: Addressing the Need for Data and
Code Sharing in Computational Science Computing in Science and
Engineering 12(5), p.8-13, 2010 doi:10.1109/MCSE.2010.113
C.Goble et. al. Accelerating Scientists’ Knowledge Turns
Communications in Computer and Information Science Volume 348,
2013, pp 3-25 doi:10.1007/978-3-642-37186-8_1

Reproducible Science
3
Goble: SSI Collaborations Workshop 2014

Scientific Workflows
4
» Scientific workflows are at the heart of
experimental science
› Enable automation of scientific
methods
› Support experimental
reproducibility
› Encourage best practices
» There is then a need to preserve
these workflows
› Scientific development based on
method reuse and repurpose
› Conservation is key
» Workflow preservation is a
multidimensional challenge
› Representation of complex
objects
› Decay analysis, diagnosis, and
prevention
› Social Objects that can be
inspected, reused, repurposed
Preservation of scientific workflows in
data-intensive science

Preservation
Technical
Multi-step computational process
Repeatable and comparative
Explicate computation
Social
Virtual Witnessing
Transparent, precise, citable
documentation
Accurate provenance logs
Reusable protocols, know-how,
best practice
Can I review /
repeat your
method?
Can I defend
my method?
Can I reuse /
reproduce this
method?

Context: Semantic Web and Linked
Data
• SW: Explicit machine-readable representation of information
• LD: A set of best practices for publishing
and connecting data on the Web
1. Use URIs to name things
2. Use dereferencable HTTP URIs
3. Provide useful content on
lookup using standards
4. Include links to other stuff
6

• An aggregation object that bundles together experimental
resources that are essential to a computational scientific study
or investigation.
– data used
– results produced in an experiment study;
– (computational) methods employed to
produce and analyse that data;
– people involved in the investigation.
• Plus annotation information that provides additional
information about both the bundle itself and the resources of
the bundle
– descriptions
– provenance
Research Objects
7

ROs as a Currency
8
Creator
Contributor
Collaborator
Comparator
Re-User
Evaluator
Reviewer
Trainee
Trainer
Reader
Publisher
Curator
Librarian
Repository
Manager

• Three principles underlie the approach:
• Identity
– Referring to resources
(and the aggregation itself)
• Aggregation
– Describing the aggregation structure
and its constituent parts
• Annotation
– Associating information with aggregated resources.
Research Objects
9

Identity
• Mechanisms for referring to the resources that are aggregated
within a Research Object
• URIs
– Web Resources
• DOIs
– Documents/papers/datasets
• ORCID IDs
– Researchers
10

Identifier Issues
• HTTP URIs provide both access and identification
• PIDs: Persistent Identifiers (e.g.DOIs) tend to resolve to
human-readable landing pages
– With embedded links to further (possibly machine-
readable) resources
• ROs seen as non-information resources with descriptive
(RDF) metadata
– Redirection/negotiation
– Standard patterns for Linked Data resources
• Bidirectional mappings between URIs and PIDs
• Versioning through, e.g. Memento
11
H. Van de Sompel et. al. Persistent Identifiers for Scholarly Assets
and the Web: The Need for an Unambiguous Mapping 9th
International Digital Curation Conference

Aggregation
• Open Archives Initiation Object Reuse and Exchange (OAI
ORE) is a standard for describing aggregations of web
resources
– http://www.openarchives.org/ore/
• Uses a Resource Map to describe the aggregated resources
• Proxies allow for statements about the resources within the
aggregation
– Capturing context and viewpoints
• Several concrete serialisations
– RDF/XML, Atom, RDFa
12
Graceful Degradation

Annotation
• Open Annotation specification is a community developed data
model for annotation of web resources
– http://www.openannotation.org/spec/core/
• Developed by the W3C Open Annotation Community Group
• Allows for “stand-off” annotations
– Annotation as a first class citizen
• Developed to fit with Web Architecture
13
Graceful Degradation

Annotation Content
• Essential to the understanding and interpretation of the
scientific outcomes captured by a Research Object as well as
the reuse of the resources within it.
– Provenance information about the experiments, the study
or any other experimental resources
– Evolution information about the Research Object and its
resources,
– Descriptions of computational methods
or processes
– Dependency information or settings
about the experiment executions
14

Core & Extensions
• Core model provides support for aggregation and annotation
• Extensions provide additional vocabularies for domain specific
tasks
• Workflow Provenance
– Information capturing workflow executions
• Workflow Description
– Abstractions describing Processes, inputs and outputs
• Research Object Evolution
– Information describing change and “snapshots”
15

Provenance
• W3C’s PROV model allows for capture of information relating
to
– Attribution
 Who did it?
– Derivation
 Data sources used
– Activities
 What happened
(and when)
• Significant eco-system (generators, viewers, consumers) has
grown up around PROV
– IPAW & TAPP
17
Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights
Reserved.

preservation and access to preserved ROs as depicted in Figure 6. Optionally, an external repository may
used to support the frequently evolving research objects. The repositories may be housed in a single
multiple physical repositories, and use the same or differing technologies (e.g. a repository may use a dig
preservation solution for the Preservation Repository and specialized digital library solution for the Acce
Repository). Additionally, as the Preservation Repository does not have the same interactive u
requirements as the access and live repositories, it could be implemented with slower (or offline) stora
alternatives.
Figure 6. Conceptual Archival System Storage Architecture.
ROs and OAIS
• ROs as Information Packages in OAIS
• myExperiment as live/access repository
• ROHUB as archival repository
19

SCAPE: Planning and Watch
20
Watch
OperationsPlanning
Env &
Users
Repository
plan
deploy
monitor monitor
monitor
access
ingest,
harvest
execution
http://www.scape-project.eu/
• SCAPE project concerned with Digital Preservation.
• Planning and Watch infrastructure to helpmmonitor
the state of a repository and co-ordinate appropriate actions
• Driven by policies.

myExperiment and RODL
Decay, Service
Deprecation,
Data source monitoring,
Checklists,
Minimal Models
Wf4Ever: Monitoring and Watch
21
Watch
OperationsPlanning
Env &
Users
Repository
plan
deploy
monitor monitor
monitor
access
ingest,
harvest
execution
• Ideas applied to workflow preservation

Decay
• Survey of 92 Taverna workflows from myExperiment
• Volatile Third-Party
Resources
• Missing Data
• Missing Execution Environments
• Poor descriptions
22
Belhajjame et. al. Why workflows break — Understanding
and combating decay in Taverna workflows e-Science 2012
doi:10.1109/eScience.2012.6404482
(a) An overview of the decay causes. (b) Workﬂow decay due to third party resources.
Fig. 3. Summary of workﬂow decay causes.

Checklists and Validation
• Checklists widely used to support safety, quality and
consistency
• Common in experimental science
– Expressing minimum information
required
– Supporting “health” monitoring of
workflow-centric ROs.
• Checklists can be defined in terms of
the RO model and its annotations
– Generic checklist service then
executes against that model and
the given annotations
– Provenance 23

Minim Data Model
pliant” or “ minimally compliant” with a checklist if it satisfies all of its MAY,
SHOULD or MUST items respectively.
Fig. 1. An overview of the Minim model schema.
Checklist
Requirement
QueryTestRule SparqlQuery
Result modifier
(string)
Query pattern
(string)Rule
CardinalityTest
Min cardinality
(integer)
AggregationTest
URI template
(string)
Max cardinality
(integer)
min
max
affirmRuleaggregatesTemplate
hasRequirement:
hasMustRequirement
hasShouldRequirement
hasMayRequirement
isLiveTemplate
sparql_query result_mod
toModel Notation key:
Explicit entity Implicit (super)class
Literal value
(type)
property
query
graph
QueryResultTest
RuleTest
exists
0..1
0..1
1
1
0..1
0..1
1 1
1
1..*
SoftwareEnvRule
URI template
(string)
Query
AccessibilityTest
URI template
(string)
ExistsTest
Rule
max 1 1
Query
Model
isDerivedBy
1..1
Our Minim data model (see Figure 1) provides 4 core constructs to express
a quality requirement: 24
Zhao et. al. A Checklist-Based Approach for
Quality Assessment of Scientific Information
3rd In. Workshop on Linked Science, 2013

RO Bundle
• A single, transferable object encapsulating the description and
resources of an RO
– Download, transfer, publish
• ZIP-based format (resources) plus a manifest describing
aggregation and annotations (description)
– Unpack with standard tooling
• JSON-LD as a representation for manifest
– Lightweight linked-data format
– Compatible with existing JSON tooling and services
– PROV-O and OAC for annotations
27
http://wf4ever.github.io/ro/bundle/

Bundling via git/Zenodo/figshare
• Scientist works with local folder structure.
– Version management via github.
– Local tooling produces metadata description
– Metadata about the aggregation (and its resources)
provided by “hidden folder”
• Zenodo/figshare pull snapshot from github
– Providing DOIs for the aggregrations
– Additional release cycles can prompt new DOIs
28

ROs as RDFa
31
http://rohub.linkeddata.es

RDFa
32
http://rohub.linkeddata.es

COMBINE Archive
34
http://co.mbine.org/documents/archive

GigaScience/ISA
35
http://isa-tools.github.io/soapdenovo2/

Wrap Up
• Aggregation objects bundling together experimental resources
that are essential to a computational scientific study or
investigation
– Intended to support greater transparency and
reproducability
• Annotations provide additional information
about the bundle and its contents
– Metadata is key here
• Use of existing standards, vocabularies and
infrastructure
• Nascent tooling to support creation,
management and publication
37

Thanks!
• All the members of the Wf4Ever team
– iSOCO: Intelligent Software Components S.A., Spain
– University of Manchester, School of Computer Science, Manchester, United
Kingdom
– University of Oxford, Department of Zoology, Oxford, UK
– Poznan Supercomputing and Networking Center. Poznan, Poland
– IAA: Instituto de Astrofísica de Andalucía, Granada, Spain
– Leiden University Medical Centre, Centre for Human and Clinical Genetics,
The Netherlands
• Colleagues in Manchester’s Information Management Group
• RO Advisory Board Members
38
http://www.researchobject.org
http://www.wf4ever-project.org

Metadata for Research Objects

More Related Content

What's hot

Viewers also liked

Similar to Metadata for Research Objects

More from seanb

Recently uploaded

Metadata for Research Objects

Editor's Notes