Research Objects for FAIRer Science

Research Objects for
FAIRer Science
Professor Carole Goble CBE FREng FBCS
The University of Manchester, UK
carole.goble@manchester.ac.uk
VIVO/SciTS Conferences 6-8 August 2014, Austin,TX

Scientific publications have at least
two goals:
(i) to announce a result and
(ii) to convince readers that the
result is correct
…..
papers in experimental science
should describe the results and
provide a clear enough protocol to
allow successful repetition and
extension
Jill Mesirov
Accessible Reproducible Research
Science 22Jan 2010: 327(5964): 415-416
DOI: 10.1126/science.1179653
VirtualWitnessing*
*Leviathan and the Air-Pump: Hobbes, Boyle, and the
Experimental Life (1985) Shapin and Schaffer.

VirtualWitnessing*
*Leviathan and the Air-Pump: Hobbes, Boyle, and the
Experimental Life (1985) Shapin and Schaffer.
Capturing, representing,
sharing the information
needed to understand how a
research result came about.
Context of results
• Inputs, outputs, process…
Context of resources
• Instruments, data, software,
people…

“An article about computational
science in a scientific publication
is not the scholarship itself, it is
merely advertising of the
scholarship. The actual
scholarship is the complete
software development
environment, [the complete
data] and the complete set of
instructions which generated the
figures.”
David Donoho, “Wavelab and Reproducible
Research,” 1995
datasets
data collections
standard operating
procedures
software
algorithms
configurations
tools and apps
codes
workflows
scripts
code libraries
services,
system software
infrastructure,
compilers
hardware
Morin et al Shining Light into Black Boxes
Science 13 April 2012: 336(6078) 159-160
Ince et alThe case for open computer programs
Nature 482, 2012

“I can’t immediately reproduce the research in
my own laboratory. It took an estimated 280
hours for an average user to approximately
reproduce the paper.”
Phil Bourne
NIH BigWig for Data Science

a reproducibility paradox
big, fast,
complicated,
multi-step,
multi-type
multi-field
greater
expectations
of
reproducibility
diy publishing
greater access

Systems Biology Collaborations
Modelling
Cycle
45 organisations 112 organisations

Data
Models Articles
External
Databases
http://www.seek4science.org
Metadata
http://www.isatools.org
Ontology-driven Aggregated Content Infrastructure
(Framework) for building Sys Bio Commons
share and interlinking multi-stewarded, mixed, methods, models, data, samples…
Standards
DCAT
FOAF
Yellow
Pages

Yellow Pages
Careful
Sharing
Options

Investigations
Assays
Studies
Towards Interoperable Bioscience Data, Nature Genetics, 2012
Standards, Structure, Interlink
Just Enough Results Model
for things produced and used
in experiments

Construction
data
Validation data
Metabolomics
Mass Spec
Transcriptomics
Proteomics
Fluxomics
Publications
Mix of
locally &
remotely
hosted
content
Open Modelling Exchange Format Archive
Wolstencroft et al, Proc ISWC 2013
Just Enough Results Model for
stuff in experiments
Common elements
Data type specific elements

Experimentalists,
modellers & developers
Cross-site, cross project
collaboration
Knowledge network
Building the System: Building a Cult
TRUST
VISION
SETTING
EXPECTATIONS
Drink together
Work together

• Collaboration –
Complementarity correlation
• Modellers share more than
Experimentalists
• Experimentalists reuse models
more than Modellers
• Active enclave sharing
• Public sharing tricky even after
publication, bribery and threats
• Data Hugging, Flirting and
Voyerism

• Playground rules apply
• Fluid, transient collaborations >
membership mgt pain in a*se
• Shameless exploitation of PI
competitiveness & vanity
• PI & Funder leadership
• Pan project spawned
collaborations –YES!!!!
• But not necessarily visible to us.

Data discovery
Data assembly,
cleaning, and
refinement
Ecological Niche
Modeling
Statistical analysis
Data collection
Insights Scholarly Communication
& Reporting
Enclosed sea problem
(Ready et al., 2010)
Pilumnus hirtellus
Scientific
Workflows

BioSTIF
method
instruments and laboratory
materials
Data discovery
Data assembly,
cleaning, and
refinement
Ecological Niche
Modeling
Statistical analysis
Data collection
Insights Scholarly Communication
& Reporting
Method Matters!

"Mapping present and future predicted distribution patterns for a meso-grazer
guild in the Baltic Sea" by Sonja Leidenberger et al

1st International Workshop on Social Object Networks (SocialObjects 2011), Boston, October 9th 2011.
Find, Click ‘n’ Go
File ‘n’ Forget
SpecialistCurators

24
Properties What would you ask a publication if you could?
Identity and Description
Uniqueness
Authenticity
Who are you ?
Where and when were you born ?
Who were your parents (creators) ?
Review, Reuse, and Repurpose For which purpose were you conceived and have been used ?
Inspection
Visualization
Annotations
What do you have inside ?
Representation How is your content structured ?
Access Rights May I access all your parts ?
Adaptability Which parts can I replace ?
Evolution & Versioning
Provenance
What have they done to you ?
Who and When ?
Why did they do that ?
Quality Why are you relevant to me ?
Can I believe what you are saying or trust your results ?
Reproducibility Do you still produce the same results ?
Fitness Are you still working ?
How could I repair you ?
Credit and attribution How could I thank you ?
How could I talk about you ?

From
Manuscripts
to
“Research Objects”
A meme
The multi-
dimensional paper
Packs

Howard Ratner, STM Innovations Seminar 2012
was: Chair STM Future Labs Committee, CEO EVP Nature PublishingGroup,
now: Director of Development for CHORUS (Clearinghouse for the Open Research of US)
http://www.youtube.com/watch?v=p-W4iLjLTrQ&list=PLC44A300051D052E5
http://www.myexperiment.org/packs/196.html

What The Commons* Is and Is Not
 Is Not:
– A database
– Confined to one physical
location
– A new large
infrastructure
– Owned by any one group
 Is:
– A conceptual framework
– Analogous to the Internet
– A collaboratory
– A few shared rules
• All research objects
have unique
identifiers
• All research objects
have limited
provenance
Philip E. Bourne Ph.D.
Associate Director for Data Science, National Institutes of Health
http://www.slideshare.net/pebourne
*The NIH BD2K Commons Framework $100million in 2015

Social
Objects
carriers of discourse

http://www.researchobject.org/
A Framework to Bundle and Relate multi-hosted
(digital) resources of a scientific experiment or
investigation using standard mechanisms & uniform
access protocols. Carriers of Research Context
Outputs are first class
citizens to be managed,
credited and tracked:
data, software
Research Objects

Links
• Recording & linking
together the
components of an
experiment
• Linking across
experiments.

Preserve
Archive
Reproduce*
Recompute
Reuse
Train & Explain
Exchange
Remix
Fix
* a word that means many things…..

re-compute
replicate
rerun repeat
re-examine
repurpose
recreate
reuse
restore
reconstruct
review
regenerate
revise
recycle
regenerate
the figure
redo
Results may vary

repeat replicate
DrummondC Replicability is not Reproducibility: Nor is it Good Science, online
Peng RD, Reproducible Research in Computational Science Science 2 Dec 2011: 1226-1227.
Methods
(techniques, algorithms,
spec. of the steps)
Materials
(datasets, parameters,
algorithm seeds)
Experiment
Instruments
(codes, services, scripts,
underlying libraries)
Laboratory
(sw and hw infrastructure,
systems software,
integrative platforms)
Setup
reusereproduce
Executable Research Object

same experiment
same set up
same lab
same experiment
same set up
different lab
same experiment
different set up
different experiment
some of same
Validate
reusereproduce
repeat replicate
http://www.biomedcentral.com/biome/carole-goble-on-reproducible-
research-what-it-really-means-how-to-reach-it/

Design
Execution
Result Analysis
Collection
Publish /
Report
Peer
Review
Peer
Reuse
Modelling
Can I repeat &
defend my
method?
Can I review / reproduce
and compare my results /
method with your results /
method?
Can I review /
replicate and certify
your method?
Can I transfer your
results into my
research and reuse
this method?
* Adapted from Mesirov, J. Accessible Reproducible Research Science 327(5964), 415-416 (2010)
Research Report
Prediction
Monitoring
Cleaning

specialist codes
libraries, platforms, tools
services
(cloud)
hosted
services
commodity
platforms
data collections
catalogues software
repositories
my data
my process
my codes
integrative
frameworks
gateways

data
carpentry
http://software-carpentry.org/

Components &
Dependencies
• 35 kinds of annotations
• 5 Main Workflows
• 14 Nested Workflows
• 25 Scripts
• 11 Configuration files
• 10 Software dependencies
• 1Web Service
• Dataset: 90 galaxies
observed in 3 bands
• Multiple platforms
• Multiple systems
José Enrique Ruiz (IAA-CSIC)
Galaxy
Luminosity
Profiling

Executable Instrument
Entropy
Zhao,Gomez-Perez, Belhajjame, Klyne,
Garcia-Cuesta,Garrido, Hettne, Roos, De
Roure and Goble.Why workflows break -
Understanding and combating decay in
Taverna workflows, 8th Intl Conf e-Science
2012
Mitigate
Detect, Repair
Preserve
Partial replication
Approx. reproduction
Verification
Benchmarks

Executable Instrument Entropy
Prepare to Repair
Reproducibility by Inspection
Read It
Reproducibility by Invocation
Run It
Document Instrument

[Adapted Freire, 2013]
provenance
gather dependencies
capture steps
track & keep results
portability
variability tolerance
preservation
packaging
versioning
open
accessible
available
machine actionable
description
intelligible
machine-readable

Authoring
Exec. Papers
Link docs to experiment
Sweave
Provenance
Tracking,
Versioning
Replay, Record, Repair
Workflows,
makefiles
ProvStore
provenance
gather dependencies
capture steps
open
accessible
available
machine actionable
description
intelligible
machine-readable

packaging
portability
preservation
provenance
gather dependencies
capture steps
versioning
host
service
Open Source/Store
Sci as a Service
Integrative fws
Virtual Machines
Recompute, limited
installation, Black Box
Byte execution, copies
Descriptive read,
White Box
Archived record
Read & Run, Co-location
No installation
Portable Package
White Box, Installation
Archived record

host
service
ReproZip
packaging
portability
preservation
provenance
gather dependencies
capture steps
versioning

No Green Fields
No One System
Find Access Interop Reuse
Porting across Platforms
Exchange between Systems
Comparing across Labs

Identity
Description
Packaging
Refer to
aggregations
and their
resource
contents
Interpretation:
What does it
mean?
How can I
compare with
others?
How is it linked
together and
linked to others?
Describe
aggregation
structure and its
constituent
parts
Container
regardless of
host
FAIR RO Core Model
manifest
Uniform and first
class handling of
diverse types
(data, software,
workflows…)

Identity
Annotation
Aggregation
FAIR RO Core Model
DOIs
URIs
Handles
ORCID
W3C
OAM
OAI-
ORE
Open
Annotation
Model
OAI-Object
Reuse and
Exchange

Identity
Annotation
Aggregation
FAIR RO Core Model
DOIs
URIs
Handles
ORCID
Aggregations
Resource maps
Proxies
Annotation first
class and stand-off
Identity persistence
and resolution
Citation
W3C
OAM
OAI-
ORE

Identity
Annotation
Aggregation
FAIR RO Core Platforms
DOIs
URIs
Handles
ORCID
Data Citation
Implementation
W3C
OAM
OAI-
ORE

Distributed
Third Party
Tenancy
Alien
Store
Aggregation
Carrier of Research Context
• Identifiable, citable, resolvable
• Uniform Management
• Mixed Stewardship
• Decay & Graceful Degrade
• Content & Aggregation
Lifecycles
• Annotations
• Manifests, Recipes,
Permissions, Discourse
Aggregations
• Dispersed / Encapsulated
• External (linked) / Local
• Mixed types
• Blackboxes
• Virtual / Materialised
Content Resources
• Aggregations themselves
• In many aggregations
• Virtual / Materialised
• Open / Closed

TARDIS:Time and Relative Dimension in
Space

• RO Management
– Transportation / Access / Citation
– Id location of RO “container”
– Provenance of RO & contents
– Behaviour/lifecycle of RO & contents
– Policies
• RO Interpretation
– What the RO and its content mean
– How they can be compared and validated
– How they can be used, executed, linked
• Interpretation variations
– Type (e.g.Workflows)
– Discipline (e.g. Biology)
– Task (e.g. Discovery, Execution)
– Activity (e.g. Experiment)
Progression Levels
Management and Interpretation for Integrated Applications

Progression Levels
Management and Interpretation for Integrated Applications
• RO Management
– Transportation / Access / Citation
– Id location of RO “container”
– Provenance of RO & contents
– Behaviour/lifecycle of RO & contents
– Policies
• RO Interpretation
– What the RO and its content mean
– How they can be compared and validated
– How they can be used, executed, linked
• Interpretation variations
– Type (e.g.Workflows)
– Discipline (e.g. Biology)
– Task (e.g. Discovery, Execution)
– Activity (e.g. Experiment)

Checklists
Versioning
Provenance
Dependencies
More
Stakeholders
& Services
Citation
minimum
More specialised
detail
Fewer but more
specialised
stakeholders &
services
Annotation
Profiles
.
Depth: how deeply
described
Coverage: how
much is covered.
Progression levels
Semantic Framework

Checklists
Versioning
Provenance
Dependencies
NISO-JATS
EXPO, ISA
JERM, OBI
MIAME, SBML
GIT
MIM Ontology
PROV
PAV
VoID
Puppet Docker
Make
PAV
RO Model roevowfprov
wfdesc
SysBio Workflows
DCAT
Annotation
Profiles
.
Depth: how deeply
described
Coverage: how
much is covered.
Progression levels
Semantic FrameworkExperiment
VIVO-ISF
DC

Checklists
aka Minimum Information Models
 Safety, quality, consistency
 Validation, monitoring
 Common in experimental
science
 Checklists defined in terms of
the RO model and its
annotations
 Services execute against
model and an RO’s
annotations Zhao et. al. A Checklist-BasedApproach for QualityAssessment
of Scientiﬁc Information 3rd In.Workshop on LinkedScience, 2013
Minim Checklist Ontology to
describe checklists
Must, Should…
Cardinalities…
Rules…
http://purl.org/net/mim/ns

Towards Smart IntegratedApplications & Mediation
1. Id & Cite fluid things
2. First class citizenship &
uniform handling of artifacts
3. Compound
4. Mixed, leaky Containers
5. Span outcomes, evolve
outputs, emergence
6. Layered interpretation and
management profiles using
standards
7. Machine-processable
8. Technology Independent
Bechhofer,Why linked data is not enough for scientists,
DOI: 10.1016/j.future.2011.08.004

Towards Smart IntegratedApplications & Mediation
Bechhofer,Why linked data is not enough for scientists,
DOI: 10.1016/j.future.2011.08.004
1. Id & Cite fluid things
2. First class citizenship &
uniform handling of artifacts
3. Compound
4. Mixed, leaky Containers
5. Span outcomes, evolve
outputs, emergence
6. Layered interpretation and
management profiles using
standards
7. Machine-processable
8. Technology Independent

Research Objects Framework
a systematic approach to representing
a different unit of scholarship
“development” view“logical” view
“process” view “physical” view
SERVICESPOLICIES
LIFECYCLESMETADATA
PROFILES

ments as the access and live repositories, it could be implemented with slower (or offline) stora
tives.
Open Archival Information System Pilot
ROs are “Information Packages”
ROManager
RODL

• A single, transferable object
encapsulates description and
resources
– Download, transfer, publish
• ZIP-based format + manifest
describes aggregation and
annotations
– Unpack with standard tooling
• JSON-LD for manifest
– Lightweight linked-data format
– Use JSON tooling and services
Baking with off the
shelf platforms
OMEX archive
bundle
Adobe
UCF
OREPROVODF

• Work with local folder
structure.
– Version: github.
– Metadata: Local tooling
– Metadata about aggregation
and its resources: “hidden
folder”
• Zenodo/figshare pull
snapshot from github
– DOIs for aggregation
– new DOIs: release cycles
Baking with off the
shelf platforms
http://dx.doi.org/10.6084/m9.figshare.1031591

FARSITE
coded descriptions of
clinical study cohorts
an NHS tool to assess the
feasibility of gathering a cohort
packages codes,
study, and metadata
Home
Baking

integrated database and journal
http://www.gigasciencejournal.com
galaxy.cbiit.cuhk.edu.hk
[Peter Li]

Nanopub: represents structured
data along with its provenance in a
single publishable and citable entry
Galaxy workflows: re-enact the analysis
Research Object:
aggregates the
(digital) resources
contributing to
findings of
(computational)
research (results,
data and software)
as citable
compound digital
objects
http://isa-tools.github.io/soapdenovo2/
http://sandbox.wf4ever-project.org/portal/ro?ro=http://sandbox.wf4ever-project.org/rodl/ROs/SOAP2denovo2-Aureus/
[Alejandra Gonzalez-Beltran
Philippe Rocca-Serra]

what’s the least we can do?
how might ROs minted and used by science teams?
how might ROs be implemented and used by developer teams?
Standards
Models
Platforms
Id Schemes
Resolution
Light touch
Extensible
Infiltration
Mapping
Making,
Curating, Using
Nudging
Sharing
Linking
Infiltration
Embedding into
and changing
work practices
TOOLS
Citing
Technical Social
Reward
Mixed stewardship
Citation
Schemes
Fragility

(meta)Data Capture Platforms
ProcessCapture Platforms

Stealthy not Sneaky
to reduce the friction
instrument the world
Incremental
JIJIT not JIC
Focus on Personal
Productivity
not Public Good
Auto-magical
From made reproducible to born reproducible
What’s the least we can do?

KnowledgeTurns
Transportation & Mediation
Unit of Scholarly Currency
Context, Comparison
Distributed: Search, Discover, Index, Harvest, Port
Research Turns
Release model: Evolution, Emergence,
Discourse, Comparison, Historical review
Forks, Merges & Fixivity
Flow across groups, projects and articles
Anti-Salami, Threaded Publications
Schopf, Treating Data Like Software: A Case for Production Quality Data, JCDL 2012Goble, De Roure, Bechhofer, Accelerating Knowledge Turns, I3CK, 2013
Profile Focus
Body of knowledge around methods, workflows,
software, data, person, rather than publication.
First class citation, credit and respect

Open Research Practice is (increasingly) like
Open Source Software Practice.
(Which we know a lot about)

FAIR research practice benefits from a shared and
principled approach for identification, aggregation
and annotation of research components of all kinds.
– Using existing standards, vocabularies, frameworks,
platforms, infrastructures. Using linked data and
semantic interoperability
VIVO - to represent the
full context of
researchers’ work.
SciTS – to study the
research process and
research collaboration

• Barend Mons
• Sean Bechhofer
• Philip Bourne
• Matthew Gamble
• Raul Palma
• Jun Zhao
• AlanWilliams
• Stian Soiland-Reyes
• Paul Groth
• Tim Clark
• Juliana Freire
• Alejandra Gonzalez-Beltran
• Philippe Rocca-Serra
• Ian Cottam
All the members of the Wf4Ever team
iSOCO: Intelligent Software Components S.A.,
Spain
University of Manchester, School of Computer
Science, Manchester, United Kingdom
University of Oxford, Department of Zoology,
Oxford, UK
Poznan Supercomputing and Networking
Center. Poznan, Poland
IAA: Instituto de Astrofísica de Andalucía,
Granada, Spain
Leiden University Medical Centre, Centre for
Human and Clinical Genetics, The Netherlands
Colleagues in Manchester’s Information
Management Group
RO Advisory Board Members
http://www.researchobject.org
http://www.wf4ever-project.org

Research Objects for FAIRer Science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Research Objects for FAIRer Science

Similar to Research Objects for FAIRer Science (20)

More from Carole Goble

More from Carole Goble (20)

Recently uploaded

Recently uploaded (20)

Research Objects for FAIRer Science