Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2016 05-20-clariah-wp4

2,244 views

Published on

Progress of CLARIAH workpackage 4

Published in: Education
  • Be the first to comment

  • Be the first to like this

2016 05-20-clariah-wp4

  1. 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . WP4: the structured datahub: linked data, big and small Auke Rijpma, a.rijpma@uu.nl May 20, 2016
  2. 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Today ▶ Data problem in ES(D)H ▶ Linked-data solution ▶ Demos interaction triplestore
  3. 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems to solve ▶ Economic, social, and demographic historians have a long tradition of data-intensive research. ▶ Two issues: 1. As databases grow bigger and more complex, working with them becomes more difficult. Examples: HSN, CamPop, NAPP. 2. There are many small-to-medium size datasets that are isolated (exist on one computer, one repository, or are not harmonised with other datasets): the “long tail” of research data. ▶ Difficult to describe, share, and replicate research results. ▶ Equally important: difficult to answer questions that span more than one dataset (comparative or multilevel research).
  4. 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Disconnected data !
  5. 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Disconnected data Data Preparation Common Motifs in Scientific Workflows: An Empirical Analysis Daniel Garijo⇤, Pinar Alper †, Khalid Belhajjame†, Oscar Corcho⇤, Yolanda Gil‡, Carole Goble† ⇤Ontology Engineering Group, Universidad Polit´ecnica de Madrid. {dgarijo, ocorcho}@fi.upm.es †School of Computer Science, University of Manchester. {alperp, khalidb, carole.goble}@cs.manchester.ac.uk ‡Information Sciences Institute, Department of Computer Science, University of Southern California. gil@isi.edu Abstract—While workflow technology has gained momentum in the last decade as a means for specifying and enacting compu- tational experiments in modern science, reusing and repurposing existing workflows to build new scientific experiments is still a daunting task. This is partly due to the difficulty that scientists experience when attempting to understand existing workflows, which contain several data preparation and adaptation steps in addition to the scientifically significant analysis steps. One way to tackle the understandability problem is through providing abstractions that give a high-level view of activities undertaken within workflows. As a first step towards abstractions, we report in this paper on the results of a manual analysis performed over a set of real-world scientific workflows from Taverna and Wings systems. Our analysis has resulted in a set of scientific workflow motifs that outline i) the kinds of data intensive activities that are observed in workflows (data oriented motifs), and ii) the different manners in which activities are implemented within workflows (workflow oriented motifs). These motifs can be useful to inform workflow designers on the good and bad practices for workflow development, to inform the design of automated tools for the generation of workflow abstractions, etc. I. INTRODUCTION Scientific workflows have been increasingly used in the last decade as an instrument for data intensive scientific analysis. In these settings, workflows serve a dual function: first as detailed documentation of the method (i. e. the input sources and processing steps taken for the derivation of a certain data item) and second as re-usable, executable artifacts for data-intensive analysis. Workflows stitch together a variety of data manipulation activities such as data movement, data transformation or data visualization to serve the goals of the scientific study. The stitching is realized by the constructs made available by the workflow system used and is largely shaped by the environment in which the system operates and the function undertaken by the workflow. A variety of workflow systems are in use [10] [3] [7] [2] serving several scientific disciplines. A workflow is a software artifact, and as such once developed and tested, it can be shared and exchanged between scientists. Other scientists can then reuse existing workflows in their experiments, e.g., as sub-workflows [17]. Workflow reuse presents several advan- tages [4]. For example, it enables proper data citation and [14] and CrowdLabs [8] have made publishing and finding workflows easier, but scientists still face the challenges of re- use, which amounts to fully understanding and exploiting the available workflows/fragments. One difficulty in understanding workflows is their complex nature. A workflow may contain several scientifically-significant analysis steps, combined with various other data preparation activities, and in different implementation styles depending on the environment and context in which the workflow is executed. The difficulty in understanding causes workflow developers to revert to starting from scratch rather than re-using existing fragments. Through an analysis of the current practices in scientific workflow development, we could gain insights on the creation of understandable and more effectively re-usable workflows. Specifically, we propose an analysis with the following objec- tives: 1) To reverse-engineer the set of current practices in work- flow development through an analysis of empirical evi- dence. 2) To identify workflow abstractions that would facilitate understandability and therefore effective re-use. 3) To detect potential information sources and heuristics that can be used to inform the development of tools for creating workflow abstractions. In this paper we present the result of an empirical analysis performed over 177 workflow descriptions from Taverna [10] and Wings [3]. Based on this analysis, we propose a catalogue of scientific workflow motifs. Motifs are provided through i) a characterization of the kinds of data-oriented activities that are carried out within workflows, which we refer to as data- oriented motifs, and ii) a characterization of the different man- ners in which those activity motifs are realized/implemented within workflows, which we refer to as workflow-oriented motifs. It is worth mentioning that, although important, motifs that have to do with scheduling and mapping of workflows onto distributed resources [12] are out the scope of this paper. The paper is structured as follows. We begin by providing related work in Section II, which is followed in Section III by brief background information on Scientific Workflows, and the Fig. 3. Distribution of Data-Oriented Motifs per domain Fig. 5. Data Fig. 3. Distribution of Data-Oriented Motifs per domain Fig. 5. Data Preparation Motifs in the Genomics Workflows
  6. 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The team Rinke Hoekstra, Kathrin Dentler, Albert Meroño Peñuela (VU), Laurens Rietveld (Triply), Richard Zijdeman, Ashkan Ashkpour (IISH).
  7. 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linked data as a solution ▶ To solve this, we use web-based linked data technology. ▶ Method of publishing data on the web so that it can be interlinked and given semantic meaning. + : Very flexible, sidesteps harmonisations, very expressive query language (SPARQL), cross-database queries (even on different servers), live querying, browseable database, ability to combine metadata and “codebook” with actual data. – : not optimised for very big databases (10m+ observations → 100m triples), unfamiliar technology.
  8. 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Plan ▶ Offer updated versions of important databases for economic and social historians (HSN, Clio-Infra, Campop, Mosaic, Henry-Fleury, CMGPD, Opgaafrollen, etc.) in one place, as linked data, in an accessible way:. ▶ Allow users to upload and share their datasets. ▶ Allow and encourage users to link datasets to other datasets or important standards to grow a graph of connected datasets. ▶ Provide direct, browseable, queryable access and tooling for visualisation and analysis. Empower Individual Researchers • Augment and link individual datasets according to best practices of the community or against colleagues • Share machine-interpretable code books with fellow researchers • Align codes and identifiers across datasets • Publish standards-compliant, reusable datasets Grow a giant graph of interconnected datasets
  9. 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demos Demonstrate triplestore and tools to interact with it. QBer: http://qber.clariah-sdh.eculture.labs.vu.nl Brwsr: http://data.clariah-sdh.eculture.labs.vu.nl/doc/ resource/napp/observation/canada1891/62489 YASGUI: http://virtuoso.clariah-sdh.eculture.labs.vu.nl/ yasgui-auth/ Grlc: http://grlc.clariah-sdh.eculture.labs.vu.nl/ CLARIAH/wp4-queries/api-docs
  10. 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demos ▶ QBer: “Connect your data to the cloud” (Rinke) 1. Augment and link datasets according to best practice. 2. Align codes and identifiers across datasets. 3. Share machine-readable codebooks. 4. Publish standardised datasets. ▶ http://qber.clariah-sdh.eculture.labs.vu.nl ▶ http://inspector.clariah-sdh.eculture.labs.vu.nl
  11. 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demos ▶ Brwsr: Lightweight linked data browser (Rinke). 1. Browse linked data as if it is a web page. ▶ http://data.clariah-sdh.eculture.labs.vu.nl/doc/ resource/napp/observation/canada1891/62489
  12. 12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demos ▶ Querying: Virtuoso, SPARQL, and YASGUI (Kathrin, Laurens) 1. Virtuoso triplestore, currently with 1b+ triples (1 188 852 440) 2. SPARQL as expressive query language 3. YASGUI as feature-rich editor. ▶ http://virtuoso.clariah-sdh.eculture.labs.vu.nl/ yasgui-auth/
  13. 13. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demos ▶ Grlc: git repository linked data API constructor (Albert). 1. Builds Web APIs using SPARQL queries stored in git repositories. 2. Store and share queries. 3. Parametrise queries to overcome unfamiliarity with Grlc. ▶ http://grlc.clariah-sdh.eculture.labs.vu.nl/ CLARIAH/wp4-queries/api-docs
  14. 14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demos entify locally, extrapolate obally…? canada sweden (Intercept) 3.616*** 4.430*** (0.134) (0.033) log(gdppc) 0.036** -0.070*** (0.018) (0.004) I(age^2) -0.000*** -0.000*** 0.000 0.000 age 0.007*** 0.001*** 0.000 0.000 R2 0.013 0.021 Adj. R2 0.012 0.021 Num. obs. 36201 275127 RMSE 0.142 0.102 ● ● ● ● ● ● ● ● ● ● ● ●● ● 20 30 40 50 60 70 3.984.004.024.04 Canada age log(hiscam) ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 3.984.004.024.04 Canada log(gdppc) log(hiscam) ● ● ● ●● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●●●● ● ● ● ● ●●●● ● ● ● ●●● ● ●●● ● ● ● ● ● ●● ● ● 20 30 40 50 60 70 3.903.943.984.02 Sweden age log(hiscam) ● ●●● ● ●●● ●● ● ●●● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6.8 6.9 7.0 7.1 7.2 7.3 3.903.943.984.02 Sweden log(gdppc) log(hiscam)
  15. 15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future work ▶ Continue working with researchers (Ivo Zandhuis, Ruben Schalk) for use cases and spread the gospel. ▶ Increase data volume. ▶ Make sure triplestore can handle data volume. ▶ Make all components work together. ▶ Make a good (appealing and accessible) interface. More info: ▶ http://datalegend.net ▶ https://github.com/clariah ▶ See(n) us at conferences: ESSHC, Posthumus, WHiSe, …

×