• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Scientific data management   from the lab to the web

Scientific data management from the lab to the web



The digital universe is booming, especially metadata and user-generated data. This raises strong challenges in order to identify the relevant portions of data which are relevant for a particular ...

The digital universe is booming, especially metadata and user-generated data. This raises strong challenges in order to identify the relevant portions of data which are relevant for a particular problem and to deal with the lifecycle of data. Finer grain problems include data evolution and the potential impact of change in the applications relying on the data, causing decay. The management of scientific data is especially sensitive to this. We present the Research Objects concept as the means to indentify and structure relevant data in scientific domains, addressing data as first-class citizens. We also identify and formally represent the main reasons for decay in this domain and propose methods and tools for their diagnosis and repair, based on provenance information. Finally, we discuss on the application of these concepts to the broader domain of the Web of Data: Data with a Purpose.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • In this scenario student Dennis has made a conceptual workflow that takes the result of a gene expression experiment (activity values of all genes under two conditions: with/without a chemical compound). The wet laboratory experiment was done by others then Dennis. He makes a note of the origin (including a paper reference). The initial hypothesis is that the chemical compound disturbs gene expression. It is yet unknown which genes and what biological processes are affected. The conceptual workflow first performs one of the standard data preprocessing steps for the type of data Dennis has (Affymetrix gene expression array), then it uses a statistical test to filter those genes that are significantly differentially expressed between the two conditions, and finally it performs an enrichment test to find those pathways that are most prominent among the filtered genes. The latter requires an annotation process, where each gene is coupled to the pathways it was once implied in in other experiments (there is a database for that: KEGG).Dennis is new to workflows, so he wishes to start with an existing workflow. For each component he will search myExperiment for keywords. He then wishes to understand the workflows: look into them, perform test runs with test data and his own data, and see other peoples logs. When he finds workflows he does not understand, Dennis is inclined to create his own workflow with his own scripts. He will receive scripts from colleagues and perform tests that his colleagues are familiar with. As such, he can learn what his workflow is doing. This will help him interpret his results.Ultimately, the workflow may suggest for instance that the set of differentially expressed genes has the Wnt pathway as most common denominator. This pathway is well known for embryogenesis and cancer, information he finds on the internet. He makes a note of that. It will lead to the hypothesis that the chemical compound, may have effects on embryogenesis and/or cancer. This is now his interpretation of his experiment that he wishes to link to his experiment and the processed data. Dennis notes that in a next cycle he will want to perform another workflow that specifically tests this hypothesis, rather that perform an enrichment test. He will then look for a workflow that performs a 'global test', and replace this part in his workflow with the global test workflow. In his log he indicates this fact. In this case he will link the result of this test (most likely a new hypothesis) to the previous experiment and in particular to the initial hypothesis. At some point, he wishes to be able to retrieve this past information and the interrelationships among his hypotheses.Assuming his finding and new hypothesis are valuable and new, he will publish his results. The publication has cleaned information, sufficient for evaluating his hypothesis and rerunning the one workflow and the one dataset that lead to this result.Dennis Working Research Object will containA reference to the source of the data and the people to acknowledge for it.The initial hypothesisThe conceptual workflow or a summary of the experiment planReferences to workflows that were tested, with comments on their application for Dennis caseA reference to the workflow(s) that Dennis eventually uses, including acknowledgement information (including a note on how these people want to be acknowledged)Dennis his workflow, possibly with a backlog of previous versions that Dennis wishes to keep for reference (with notes and comments)Dennis his workflow run, results and the recorded steps that lead to the results, in some cases with comments for later reference (e.g. 'here I used parameter A, next time I may try B')The final hypothesis, with comments.A reference to the results of the workflowA Design log that records Dennis considerations while making the workflowA Run log that records Dennis considerations while running and interpreting the workflowHis Publication Research Object will containThe workflowA caption for his workflow (filtered from his design and run log, all information necessary to run the experiment by a reviewer)A workflow run (results, and a caption filtered from run log)His initial hypothesisHis final hypothesisThe data sourceAcknowledgementsIn time, Dennis' workflow can be found on the basis of his Published and Working RO's metadata. This will create a rich and wide range of search capabilities for Dennis' successors.The Working RO is kept at Dennis local group, and is the most valuable resource for reusing the work. The Published RO is available for download and reuse. It is anticipated that interested parties will contact Dennis or his group for 'reuse in collaboration' (i.e. for the group's expertise).
  • Emphasise the use of Linked Data. Note: the figures here are not intended to be readable. They’re simply emphasising the existence of the models. Example user requirements being addressed by RO:UR1.3 aggregate existing resources to conveniently access related resources from a single placeUR1.6 describe the relationships between aggregated resources so that other researchers can see how the resources fit togetherUR1.16 annotate experimental results using semantic models so that I can find/show links to other, relevant research objects

Scientific data management   from the lab to the web Scientific data management from the lab to the web Presentation Transcript

  • www.wf4ever-project.orgScientific Data Management - From the Lab to the Web José Manuel Gómez Pérez, iSOCO Semantic Data Management Dagstuhl Seminar 22-27 April 2012
  • The data deluge Some facts » In 2010 the size of the digital universe exceeded 1 Zettabyte (=1 trillion Gb) » 1.8 Zb in 2011 » 35 Zb expected in 2020 » 90% unstructured data » 70% user-generated » 75% resulting from data copying, merging, and transforming » Metadata is the fastest growing data category » Much of such data is dynamic, real-time, volatileSource: IDC ‘s The 2011 Digital Universe Study – Extracting Value from Chaos 2
  • Dealing with dynamicity Two main challenges» Challenge 1: Identifying and structuring the relevant portions of the data for the task at hand › First-class data citizens» Challenge 2: Managing the lifecycle of data entities › Preservation › Evolution and versioning › Decay Both technical and social aspects involved 3
  • The Research Lifecycle Workflows in the Scientific MethodBackground Hypothesis Results Scientific Experiment ResultsAssumptions (data) Interpretation Publication (Data) Input data Method Example: Genome-Wide Association Studies 4
  • Workflow-based Science What is a Scientific Workflow?» A mechanism for coordinating the execution of services and linking together resources.» The combination of data and processes into a configurable, structured set of steps that implement semi-automated computational solutions in scientific problem-solving Scientific workflows are at the core of scientific data management › Enable automation › Encourage best practices 5
  • Challenge 1 Identifying and structuringthe relevant portions of the data for the task at hand First-class data citizens
  • Questions for Scientific Data and Workflows IssuesWho are you ? Identity and DescriptionWhere and when were you born ? AuthenticityWho were your parents (creators) ? UniquenessFor which purpose were you conceived and have been used ? Reuse, RepurposeWhat do you have inside ? Inspection Visualization AnnotationsHow is your content linked ? Graphical RepresentationMay I access all your parts ? Access RightsWhich parts can I replace ? AdaptabilityWhat have they done to you ? ProvenanceWho and When ? VersioningWhy did they do that ?Why have you been recommended to me ? Information QualityCan I believe what you are saying or trust your results ?Do you still produce the same results ? ReproducibilityAre you still working ? CompletenessHow could I repair you ? StabilityHow could I thank you ? CreditHow could I talk about you ? 7
  • Challenge 1: Identifying and structuring the relevant data Research Objects as Technical ObjectsCarriers of Research Context Third Party Alien» Referentiable Distributed Tenancy Store» Aggregation, Dispersed › Heterogeneous › Local and External» Annotated metadata › Provenance › Structured: Manifests, Recipes, Permissions, Discourse» Lifecycle › Publishing, Evolution › Versioning» Mixed Stewardship › Graceful Degradation» Sharing » Security & Privacy Technical Objects Social Objects» Stereotypical User Profiles» Services OAI-ORE 8
  • Research Objects as Social Objects Package, Explore, Inspect, Review, Exchange, Share, Reuse, Publish, Credit9 9 9
  • http://purl.org/wf4ever/ro# Research Object model core (simplified) RO specification: http://wf4ever.github.com/ro ore:aggregates ro:ResearchObject ro:Resource ore:isDescribedBy ro:Manifest wfdesc:Workflow ro:annotatesAggregatedResource ro:AggregatedAnnotation› ro (aggregation and annotation) Note: This figure shows a simplified view of the RO core.› wfdesc (workflow description)› Minim* (minimum info model)› wfprov (workflow provenance)› roprov (RO provenance)› roevo (evolution model) 10 *Minim based on M. Gamble’s MIM
  • Challenge 2Managing the lifecycle of data entities Evolution and Decay
  • Challenge 2: Managing the lifecycle of data entities RO Evolution & Versioning 12
  • Challenge 2: Managing the lifecycle of data entities RO DecayWorkflow Decay• Component level• flux/decay/unavailability• Data level• Infrastructure levelExperiment Decay• Methodological changes• New technologies• New resources/components• New data 13
  • Preservation, Conservation, RecreatingPreservingArchived RecordFixed SnapshotsReviewRerun & ReplayConservingActive InstrumentLiveRerun & ReuseRepair & RestoreRecreatingArchived RecordActive InstrumentLiveRebuild Recycle Repurpose 14
  • Challenge 2: Managing the lifecycle of data entities Possible types of decay (an example) 15
  • Decay Analysis A Taxonomy of RO decay1. Service tool is missing2. Service file descriptor disappeared3. Service up but not contactable4. Service up but functionality changed5. Local software dependencies6. Data unavailability7. Changes in data formats8. Chained dependency9. Credentials deprecated10. Input data superseded by other data11. RO metadata outdated (upon versioning)12. Old fashioned RO13. External references lose credit14. Execution framework no longer available 16
  • A taxonomy of workflow decay Sample decay type 17
  • Decay Analysis 1.0 Certificate – Evaluation of Stability and Completeness 1.0 Certificate of quality Stability Completeness Is the RO free from any form of decay Is the minimal aggregation of preventing workflow execution? resources encapsulated by the RO consistent? » Focus on reproducibility » RO checklists » Assisted detection of RO decay » Produced by scientists » Active monitoring on decay forms » Automatically checked against » RO and workflow provenance minimal model (minim) » RO evolution » Notification » Explanation 181.0 Certificate notion originally proposed by Yde de Jong
  • Recap Lessons learntScalability » Data with a Purpose » Encapsulate & Conquer › Goal-driven (purpose) › Aggregation › Community-managed » Nothing is immutable,Provenance especially data. › Foster evolution › Monitor decay 19
  • Thanks for your Attention! Questions Any Questions?http://www.wf4ever-project.org/ 20