Successfully reported this slideshow.

Scientific data management from the lab to the web

2

Share

Loading in …3
×
1 of 20
1 of 20

Scientific data management from the lab to the web

2

Share

Download to read offline

The digital universe is booming, especially metadata and user-generated data. This raises strong challenges in order to identify the relevant portions of data which are relevant for a particular problem and to deal with the lifecycle of data. Finer grain problems include data evolution and the potential impact of change in the applications relying on the data, causing decay. The management of scientific data is especially sensitive to this. We present the Research Objects concept as the means to indentify and structure relevant data in scientific domains, addressing data as first-class citizens. We also identify and formally represent the main reasons for decay in this domain and propose methods and tools for their diagnosis and repair, based on provenance information. Finally, we discuss on the application of these concepts to the broader domain of the Web of Data: Data with a Purpose.

The digital universe is booming, especially metadata and user-generated data. This raises strong challenges in order to identify the relevant portions of data which are relevant for a particular problem and to deal with the lifecycle of data. Finer grain problems include data evolution and the potential impact of change in the applications relying on the data, causing decay. The management of scientific data is especially sensitive to this. We present the Research Objects concept as the means to indentify and structure relevant data in scientific domains, addressing data as first-class citizens. We also identify and formally represent the main reasons for decay in this domain and propose methods and tools for their diagnosis and repair, based on provenance information. Finally, we discuss on the application of these concepts to the broader domain of the Web of Data: Data with a Purpose.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Scientific data management from the lab to the web

  1. 1. www.wf4ever-project.org Scientific Data Management - From the Lab to the Web José Manuel Gómez Pérez, iSOCO Semantic Data Management Dagstuhl Seminar 22-27 April 2012
  2. 2. The data deluge Some facts » In 2010 the size of the digital universe exceeded 1 Zettabyte (=1 trillion Gb) » 1.8 Zb in 2011 » 35 Zb expected in 2020 » 90% unstructured data » 70% user-generated » 75% resulting from data copying, merging, and transforming » Metadata is the fastest growing data category » Much of such data is dynamic, real-time, volatile Source: IDC ‘s The 2011 Digital Universe Study – Extracting Value from Chaos 2
  3. 3. Dealing with dynamicity Two main challenges » Challenge 1: Identifying and structuring the relevant portions of the data for the task at hand › First-class data citizens » Challenge 2: Managing the lifecycle of data entities › Preservation › Evolution and versioning › Decay Both technical and social aspects involved 3
  4. 4. The Research Lifecycle Workflows in the Scientific Method Background Hypothesis Results Scientific Experiment Results Assumptions (data) Interpretation Publication (Data) Input data Method Example: Genome-Wide Association Studies 4
  5. 5. Workflow-based Science What is a Scientific Workflow? » A mechanism for coordinating the execution of services and linking together resources. » The combination of data and processes into a configurable, structured set of steps that implement semi-automated computational solutions in scientific problem-solving Scientific workflows are at the core of scientific data management › Enable automation › Encourage best practices 5
  6. 6. Challenge 1 Identifying and structuring the relevant portions of the data for the task at hand First-class data citizens
  7. 7. Questions for Scientific Data and Workflows Issues Who are you ? Identity and Description Where and when were you born ? Authenticity Who were your parents (creators) ? Uniqueness For which purpose were you conceived and have been used ? Reuse, Repurpose What do you have inside ? Inspection Visualization Annotations How is your content linked ? Graphical Representation May I access all your parts ? Access Rights Which parts can I replace ? Adaptability What have they done to you ? Provenance Who and When ? Versioning Why did they do that ? Why have you been recommended to me ? Information Quality Can I believe what you are saying or trust your results ? Do you still produce the same results ? Reproducibility Are you still working ? Completeness How could I repair you ? Stability How could I thank you ? Credit How could I talk about you ? 7
  8. 8. Challenge 1: Identifying and structuring the relevant data Research Objects as Technical Objects Carriers of Research Context Third Party Alien » Referentiable Distributed Tenancy Store » Aggregation, Dispersed › Heterogeneous › Local and External » Annotated metadata › Provenance › Structured: Manifests, Recipes, Permissions, Discourse » Lifecycle › Publishing, Evolution › Versioning » Mixed Stewardship › Graceful Degradation » Sharing » Security & Privacy Technical Objects Social Objects » Stereotypical User Profiles » Services OAI-ORE 8
  9. 9. Research Objects as Social Objects Package, Explore, Inspect, Review, Exchange, Share, Reuse, Publish, Credit 9 9 9
  10. 10. http://purl.org/wf4ever/ro# Research Object model core (simplified) RO specification: http://wf4ever.github.com/ro ore:aggregates ro:ResearchObject ro:Resource ore:isDescribedBy ro:Manifest wfdesc:Workflow ro:annotatesAggregatedResource ro:AggregatedAnnotation › ro (aggregation and annotation) Note: This figure shows a simplified view of the RO core. › wfdesc (workflow description) › Minim* (minimum info model) › wfprov (workflow provenance) › roprov (RO provenance) › roevo (evolution model) 10 *Minim based on M. Gamble’s MIM
  11. 11. Challenge 2 Managing the lifecycle of data entities Evolution and Decay
  12. 12. Challenge 2: Managing the lifecycle of data entities RO Evolution & Versioning 12
  13. 13. Challenge 2: Managing the lifecycle of data entities RO Decay Workflow Decay • Component level • flux/decay/unavailability • Data level • Infrastructure level Experiment Decay • Methodological changes • New technologies • New resources/components • New data 13
  14. 14. Preservation, Conservation, Recreating Preserving Archived Record Fixed Snapshots Review Rerun & Replay Conserving Active Instrument Live Rerun & Reuse Repair & Restore Recreating Archived Record Active Instrument Live Rebuild Recycle Repurpose 14
  15. 15. Challenge 2: Managing the lifecycle of data entities Possible types of decay (an example) 15
  16. 16. Decay Analysis A Taxonomy of RO decay 1. Service tool is missing 2. Service file descriptor disappeared 3. Service up but not contactable 4. Service up but functionality changed 5. Local software dependencies 6. Data unavailability 7. Changes in data formats 8. Chained dependency 9. Credentials deprecated 10. Input data superseded by other data 11. RO metadata outdated (upon versioning) 12. Old fashioned RO 13. External references lose credit 14. Execution framework no longer available 16
  17. 17. A taxonomy of workflow decay Sample decay type 17
  18. 18. Decay Analysis 1.0 Certificate – Evaluation of Stability and Completeness 1.0 Certificate of quality Stability Completeness Is the RO free from any form of decay Is the minimal aggregation of preventing workflow execution? resources encapsulated by the RO consistent? » Focus on reproducibility » RO checklists » Assisted detection of RO decay » Produced by scientists » Active monitoring on decay forms » Automatically checked against » RO and workflow provenance minimal model (minim) » RO evolution » Notification » Explanation 18 1.0 Certificate notion originally proposed by Yde de Jong
  19. 19. Recap Lessons learnt Scalability » Data with a Purpose » Encapsulate & Conquer › Goal-driven (purpose) › Aggregation › Community-managed » Nothing is immutable, Provenance especially data. › Foster evolution › Monitor decay 19
  20. 20. Thanks for your Attention! Questions Any Questions? http://www.wf4ever-project.org/ 20

Editor's Notes

  • In this scenario student Dennis has made a conceptual workflow that takes the result of a gene expression experiment (activity values of all genes under two conditions: with/without a chemical compound). The wet laboratory experiment was done by others then Dennis. He makes a note of the origin (including a paper reference). The initial hypothesis is that the chemical compound disturbs gene expression. It is yet unknown which genes and what biological processes are affected. The conceptual workflow first performs one of the standard data preprocessing steps for the type of data Dennis has (Affymetrix gene expression array), then it uses a statistical test to filter those genes that are significantly differentially expressed between the two conditions, and finally it performs an enrichment test to find those pathways that are most prominent among the filtered genes. The latter requires an annotation process, where each gene is coupled to the pathways it was once implied in in other experiments (there is a database for that: KEGG).Dennis is new to workflows, so he wishes to start with an existing workflow. For each component he will search myExperiment for keywords. He then wishes to understand the workflows: look into them, perform test runs with test data and his own data, and see other peoples logs. When he finds workflows he does not understand, Dennis is inclined to create his own workflow with his own scripts. He will receive scripts from colleagues and perform tests that his colleagues are familiar with. As such, he can learn what his workflow is doing. This will help him interpret his results.Ultimately, the workflow may suggest for instance that the set of differentially expressed genes has the Wnt pathway as most common denominator. This pathway is well known for embryogenesis and cancer, information he finds on the internet. He makes a note of that. It will lead to the hypothesis that the chemical compound, may have effects on embryogenesis and/or cancer. This is now his interpretation of his experiment that he wishes to link to his experiment and the processed data. Dennis notes that in a next cycle he will want to perform another workflow that specifically tests this hypothesis, rather that perform an enrichment test. He will then look for a workflow that performs a 'global test', and replace this part in his workflow with the global test workflow. In his log he indicates this fact. In this case he will link the result of this test (most likely a new hypothesis) to the previous experiment and in particular to the initial hypothesis. At some point, he wishes to be able to retrieve this past information and the interrelationships among his hypotheses.Assuming his finding and new hypothesis are valuable and new, he will publish his results. The publication has cleaned information, sufficient for evaluating his hypothesis and rerunning the one workflow and the one dataset that lead to this result.Dennis Working Research Object will containA reference to the source of the data and the people to acknowledge for it.The initial hypothesisThe conceptual workflow or a summary of the experiment planReferences to workflows that were tested, with comments on their application for Dennis caseA reference to the workflow(s) that Dennis eventually uses, including acknowledgement information (including a note on how these people want to be acknowledged)Dennis his workflow, possibly with a backlog of previous versions that Dennis wishes to keep for reference (with notes and comments)Dennis his workflow run, results and the recorded steps that lead to the results, in some cases with comments for later reference (e.g. 'here I used parameter A, next time I may try B')The final hypothesis, with comments.A reference to the results of the workflowA Design log that records Dennis considerations while making the workflowA Run log that records Dennis considerations while running and interpreting the workflowHis Publication Research Object will containThe workflowA caption for his workflow (filtered from his design and run log, all information necessary to run the experiment by a reviewer)A workflow run (results, and a caption filtered from run log)His initial hypothesisHis final hypothesisThe data sourceAcknowledgementsIn time, Dennis' workflow can be found on the basis of his Published and Working RO's metadata. This will create a rich and wide range of search capabilities for Dennis' successors.The Working RO is kept at Dennis local group, and is the most valuable resource for reusing the work. The Published RO is available for download and reuse. It is anticipated that interested parties will contact Dennis or his group for 'reuse in collaboration' (i.e. for the group's expertise).
  • Emphasise the use of Linked Data. Note: the figures here are not intended to be readable. They’re simply emphasising the existence of the models. Example user requirements being addressed by RO:UR1.3 aggregate existing resources to conveniently access related resources from a single placeUR1.6 describe the relationships between aggregated resources so that other researchers can see how the resources fit togetherUR1.16 annotate experimental results using semantic models so that I can find/show links to other, relevant research objects
  • ×