SlideShare a Scribd company logo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
WP4: the structured datahub: linked data, big and
small
Auke Rijpma, a.rijpma@uu.nl
May 20, 2016
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Today
▶ Data problem in ES(D)H
▶ Linked-data solution
▶ Demos interaction triplestore
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Problems to solve
▶ Economic, social, and demographic historians have a long tradition
of data-intensive research.
▶ Two issues:
1. As databases grow bigger and more complex, working with them
becomes more difficult. Examples: HSN, CamPop, NAPP.
2. There are many small-to-medium size datasets that are isolated (exist
on one computer, one repository, or are not harmonised with other
datasets): the “long tail” of research data.
▶ Difficult to describe, share, and replicate research results.
▶ Equally important: difficult to answer questions that span more than
one dataset (comparative or multilevel research).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Disconnected data
!
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Disconnected data
Data Preparation
Common Motifs in Scientific Workflows:
An Empirical Analysis
Daniel Garijo⇤, Pinar Alper †, Khalid Belhajjame†, Oscar Corcho⇤, Yolanda Gil‡, Carole Goble†
⇤Ontology Engineering Group, Universidad Polit´ecnica de Madrid. {dgarijo, ocorcho}@fi.upm.es
†School of Computer Science, University of Manchester. {alperp, khalidb, carole.goble}@cs.manchester.ac.uk
‡Information Sciences Institute, Department of Computer Science, University of Southern California. gil@isi.edu
Abstract—While workflow technology has gained momentum
in the last decade as a means for specifying and enacting compu-
tational experiments in modern science, reusing and repurposing
existing workflows to build new scientific experiments is still a
daunting task. This is partly due to the difficulty that scientists
experience when attempting to understand existing workflows,
which contain several data preparation and adaptation steps in
addition to the scientifically significant analysis steps. One way
to tackle the understandability problem is through providing
abstractions that give a high-level view of activities undertaken
within workflows. As a first step towards abstractions, we report
in this paper on the results of a manual analysis performed over
a set of real-world scientific workflows from Taverna and Wings
systems. Our analysis has resulted in a set of scientific workflow
motifs that outline i) the kinds of data intensive activities that are
observed in workflows (data oriented motifs), and ii) the different
manners in which activities are implemented within workflows
(workflow oriented motifs). These motifs can be useful to inform
workflow designers on the good and bad practices for workflow
development, to inform the design of automated tools for the
generation of workflow abstractions, etc.
I. INTRODUCTION
Scientific workflows have been increasingly used in the last
decade as an instrument for data intensive scientific analysis.
In these settings, workflows serve a dual function: first as
detailed documentation of the method (i. e. the input sources
and processing steps taken for the derivation of a certain
data item) and second as re-usable, executable artifacts for
data-intensive analysis. Workflows stitch together a variety
of data manipulation activities such as data movement, data
transformation or data visualization to serve the goals of the
scientific study. The stitching is realized by the constructs
made available by the workflow system used and is largely
shaped by the environment in which the system operates and
the function undertaken by the workflow.
A variety of workflow systems are in use [10] [3] [7] [2]
serving several scientific disciplines. A workflow is a software
artifact, and as such once developed and tested, it can be
shared and exchanged between scientists. Other scientists can
then reuse existing workflows in their experiments, e.g., as
sub-workflows [17]. Workflow reuse presents several advan-
tages [4]. For example, it enables proper data citation and
[14] and CrowdLabs [8] have made publishing and finding
workflows easier, but scientists still face the challenges of re-
use, which amounts to fully understanding and exploiting the
available workflows/fragments. One difficulty in understanding
workflows is their complex nature. A workflow may contain
several scientifically-significant analysis steps, combined with
various other data preparation activities, and in different
implementation styles depending on the environment and
context in which the workflow is executed. The difficulty in
understanding causes workflow developers to revert to starting
from scratch rather than re-using existing fragments.
Through an analysis of the current practices in scientific
workflow development, we could gain insights on the creation
of understandable and more effectively re-usable workflows.
Specifically, we propose an analysis with the following objec-
tives:
1) To reverse-engineer the set of current practices in work-
flow development through an analysis of empirical evi-
dence.
2) To identify workflow abstractions that would facilitate
understandability and therefore effective re-use.
3) To detect potential information sources and heuristics
that can be used to inform the development of tools for
creating workflow abstractions.
In this paper we present the result of an empirical analysis
performed over 177 workflow descriptions from Taverna [10]
and Wings [3]. Based on this analysis, we propose a catalogue
of scientific workflow motifs. Motifs are provided through i)
a characterization of the kinds of data-oriented activities that
are carried out within workflows, which we refer to as data-
oriented motifs, and ii) a characterization of the different man-
ners in which those activity motifs are realized/implemented
within workflows, which we refer to as workflow-oriented
motifs. It is worth mentioning that, although important, motifs
that have to do with scheduling and mapping of workflows
onto distributed resources [12] are out the scope of this paper.
The paper is structured as follows. We begin by providing
related work in Section II, which is followed in Section III by
brief background information on Scientific Workflows, and the
Fig. 3. Distribution of Data-Oriented Motifs per domain Fig. 5. Data
Fig. 3. Distribution of Data-Oriented Motifs per domain Fig. 5. Data Preparation Motifs in the Genomics Workflows
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The team
Rinke Hoekstra, Kathrin Dentler, Albert Meroño Peñuela (VU), Laurens
Rietveld (Triply), Richard Zijdeman, Ashkan Ashkpour (IISH).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Linked data as a solution
▶ To solve this, we use web-based linked data technology.
▶ Method of publishing data on the web so that it can be interlinked
and given semantic meaning.
+ : Very flexible, sidesteps harmonisations, very expressive query
language (SPARQL), cross-database queries (even on different
servers), live querying, browseable database, ability to combine
metadata and “codebook” with actual data.
– : not optimised for very big databases (10m+ observations → 100m
triples), unfamiliar technology.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Plan
▶ Offer updated versions of important databases for economic and
social historians (HSN, Clio-Infra, Campop, Mosaic, Henry-Fleury,
CMGPD, Opgaafrollen, etc.) in one place, as linked data, in an
accessible way:.
▶ Allow users to upload and share their datasets.
▶ Allow and encourage users to link datasets to other datasets or
important standards to grow a graph of connected datasets.
▶ Provide direct, browseable, queryable access and tooling for
visualisation and analysis.
Empower Individual Researchers
• Augment and link individual datasets according to best
practices of the community or against colleagues
• Share machine-interpretable code books with fellow
researchers
• Align codes and identifiers across datasets
• Publish standards-compliant, reusable datasets
Grow a giant graph of interconnected
datasets
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Demos
Demonstrate triplestore and tools to interact with it.
QBer: http://qber.clariah-sdh.eculture.labs.vu.nl
Brwsr: http://data.clariah-sdh.eculture.labs.vu.nl/doc/
resource/napp/observation/canada1891/62489
YASGUI: http://virtuoso.clariah-sdh.eculture.labs.vu.nl/
yasgui-auth/
Grlc: http://grlc.clariah-sdh.eculture.labs.vu.nl/
CLARIAH/wp4-queries/api-docs
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Demos
▶ QBer: “Connect your data to the cloud” (Rinke)
1. Augment and link datasets according to best practice.
2. Align codes and identifiers across datasets.
3. Share machine-readable codebooks.
4. Publish standardised datasets.
▶ http://qber.clariah-sdh.eculture.labs.vu.nl
▶ http://inspector.clariah-sdh.eculture.labs.vu.nl
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Demos
▶ Brwsr: Lightweight linked data browser (Rinke).
1. Browse linked data as if it is a web page.
▶ http://data.clariah-sdh.eculture.labs.vu.nl/doc/
resource/napp/observation/canada1891/62489
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Demos
▶ Querying: Virtuoso, SPARQL, and YASGUI (Kathrin, Laurens)
1. Virtuoso triplestore, currently with 1b+ triples (1 188 852 440)
2. SPARQL as expressive query language
3. YASGUI as feature-rich editor.
▶ http://virtuoso.clariah-sdh.eculture.labs.vu.nl/
yasgui-auth/
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Demos
▶ Grlc: git repository linked data API constructor (Albert).
1. Builds Web APIs using SPARQL queries stored in git repositories.
2. Store and share queries.
3. Parametrise queries to overcome unfamiliarity with Grlc.
▶ http://grlc.clariah-sdh.eculture.labs.vu.nl/
CLARIAH/wp4-queries/api-docs
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Demos
entify locally, extrapolate
obally…?
canada sweden
(Intercept) 3.616*** 4.430***
(0.134) (0.033)
log(gdppc) 0.036** -0.070***
(0.018) (0.004)
I(age^2) -0.000*** -0.000***
0.000 0.000
age 0.007*** 0.001***
0.000 0.000
R2 0.013 0.021
Adj. R2 0.012 0.021
Num. obs. 36201 275127
RMSE 0.142 0.102
●
●
●
●
●
●
●
●
●
●
●
●●
●
20 30 40 50 60 70
3.984.004.024.04
Canada
age
log(hiscam)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5
3.984.004.024.04
Canada
log(gdppc)
log(hiscam)
●
●
●
●●
●
●
●
●●●●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●●●●
●
●
●
●
●●●●
●
●
●
●●●
●
●●●
●
●
●
●
●
●●
●
●
20 30 40 50 60 70
3.903.943.984.02
Sweden
age
log(hiscam)
●
●●●
●
●●●
●●
●
●●●
●
●
●
●●●●
●
●●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
6.8 6.9 7.0 7.1 7.2 7.3
3.903.943.984.02
Sweden
log(gdppc)
log(hiscam)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Future work
▶ Continue working with researchers (Ivo Zandhuis, Ruben Schalk)
for use cases and spread the gospel.
▶ Increase data volume.
▶ Make sure triplestore can handle data volume.
▶ Make all components work together.
▶ Make a good (appealing and accessible) interface.
More info:
▶ http://datalegend.net
▶ https://github.com/clariah
▶ See(n) us at conferences: ESSHC, Posthumus, WHiSe, …

More Related Content

What's hot

WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
Micah Altman
 
DSpace for Cultural Heritage: adding support for images visualization,audio/v...
DSpace for Cultural Heritage: adding support for images visualization,audio/v...DSpace for Cultural Heritage: adding support for images visualization,audio/v...
DSpace for Cultural Heritage: adding support for images visualization,audio/v...
Andrea Bollini
 

What's hot (20)

Esshc presentation ashkan
Esshc presentation ashkanEsshc presentation ashkan
Esshc presentation ashkan
 
SSHA 2019: Reconstructring a country
SSHA 2019: Reconstructring a countrySSHA 2019: Reconstructring a country
SSHA 2019: Reconstructring a country
 
Nanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingNanopublications and Decentralized Publishing
Nanopublications and Decentralized Publishing
 
WG5: A data wrangling experiment
WG5: A data wrangling experimentWG5: A data wrangling experiment
WG5: A data wrangling experiment
 
Citizen Science Open Data
Citizen Science Open DataCitizen Science Open Data
Citizen Science Open Data
 
Wednesday 6 May: Hand me the data! What you should know as a humanities resea...
Wednesday 6 May: Hand me the data! What you should know as a humanities resea...Wednesday 6 May: Hand me the data! What you should know as a humanities resea...
Wednesday 6 May: Hand me the data! What you should know as a humanities resea...
 
Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)
Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)
Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)
 
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
 
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard JensenTuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
 
Observlets
Observlets Observlets
Observlets
 
Open University Data
Open University DataOpen University Data
Open University Data
 
DSpace for Cultural Heritage: adding support for images visualization,audio/v...
DSpace for Cultural Heritage: adding support for images visualization,audio/v...DSpace for Cultural Heritage: adding support for images visualization,audio/v...
DSpace for Cultural Heritage: adding support for images visualization,audio/v...
 
Mind the gap! Reflections on the state of repository data harvesting
Mind the gap! Reflections on the state of repository data harvestingMind the gap! Reflections on the state of repository data harvesting
Mind the gap! Reflections on the state of repository data harvesting
 
Linked Data
Linked DataLinked Data
Linked Data
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)
 
20200130_Mannocci_OpenAIRE_ResearchGraph
20200130_Mannocci_OpenAIRE_ResearchGraph20200130_Mannocci_OpenAIRE_ResearchGraph
20200130_Mannocci_OpenAIRE_ResearchGraph
 
Think like a Digital Curator
Think like a Digital CuratorThink like a Digital Curator
Think like a Digital Curator
 
The ARIADNE interoperability framework, component architecture and registry s...
The ARIADNE interoperability framework, component architecture and registry s...The ARIADNE interoperability framework, component architecture and registry s...
The ARIADNE interoperability framework, component architecture and registry s...
 
Web Archive Research Skills and Tools Survey (WARST)
 Web Archive Research Skills and Tools Survey (WARST) Web Archive Research Skills and Tools Survey (WARST)
Web Archive Research Skills and Tools Survey (WARST)
 
Connecting Heterogeneous Collections using Linked Data
Connecting Heterogeneous Collections using Linked DataConnecting Heterogeneous Collections using Linked Data
Connecting Heterogeneous Collections using Linked Data
 

Similar to 2016 05-20-clariah-wp4

Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
Editor IJCATR
 

Similar to 2016 05-20-clariah-wp4 (20)

Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
 
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyData legend dh_benelux_2017.key
Data legend dh_benelux_2017.key
 
Linked Data: Een extra ontstluitingslaag op archieven
Linked Data: Een extra ontstluitingslaag op archieven Linked Data: Een extra ontstluitingslaag op archieven
Linked Data: Een extra ontstluitingslaag op archieven
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities Data
 
Executable papers
Executable papersExecutable papers
Executable papers
 
SYSTEMATIC LITERATURE REVIEW ON RESOURCE ALLOCATION AND RESOURCE SCHEDULING I...
SYSTEMATIC LITERATURE REVIEW ON RESOURCE ALLOCATION AND RESOURCE SCHEDULING I...SYSTEMATIC LITERATURE REVIEW ON RESOURCE ALLOCATION AND RESOURCE SCHEDULING I...
SYSTEMATIC LITERATURE REVIEW ON RESOURCE ALLOCATION AND RESOURCE SCHEDULING I...
 
Open Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and ExchangeOpen Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and Exchange
 
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10
 
A Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdfA Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdf
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methods
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
Software and Education at NSF/ACI
Software and Education at NSF/ACISoftware and Education at NSF/ACI
Software and Education at NSF/ACI
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
 
RM 4 UNIT.pptx
RM 4 UNIT.pptxRM 4 UNIT.pptx
RM 4 UNIT.pptx
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
Linked Data Workshop Stanford University
Linked Data Workshop Stanford University Linked Data Workshop Stanford University
Linked Data Workshop Stanford University
 

More from CLARIAH

More from CLARIAH (20)

ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018
ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018
ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018
 
DB:CCC Presentation of Karin Hofmeester, CLARIAH Toogdag 19-10-2018
DB:CCC Presentation of Karin Hofmeester, CLARIAH Toogdag 19-10-2018DB:CCC Presentation of Karin Hofmeester, CLARIAH Toogdag 19-10-2018
DB:CCC Presentation of Karin Hofmeester, CLARIAH Toogdag 19-10-2018
 
Masterclass innosurance 2018
Masterclass innosurance 2018Masterclass innosurance 2018
Masterclass innosurance 2018
 
Flat TLA
Flat TLAFlat TLA
Flat TLA
 
Collection registration for the CLARIAH Media Suite.
Collection registration for the CLARIAH Media Suite.Collection registration for the CLARIAH Media Suite.
Collection registration for the CLARIAH Media Suite.
 
CMDI2RDF
CMDI2RDFCMDI2RDF
CMDI2RDF
 
2016 05-20-clariah-wp3
2016 05-20-clariah-wp32016 05-20-clariah-wp3
2016 05-20-clariah-wp3
 
2016 05-20-clariah-wp2
2016 05-20-clariah-wp22016 05-20-clariah-wp2
2016 05-20-clariah-wp2
 
2016 05-20-clariah-wp5
2016 05-20-clariah-wp52016 05-20-clariah-wp5
2016 05-20-clariah-wp5
 
MTAS Henny Brugman
MTAS Henny BrugmanMTAS Henny Brugman
MTAS Henny Brugman
 
LREC Ton vd Wouden
LREC Ton vd WoudenLREC Ton vd Wouden
LREC Ton vd Wouden
 
Paqu Gertjan van Noord en Jan Odijk
Paqu Gertjan van Noord en Jan OdijkPaqu Gertjan van Noord en Jan Odijk
Paqu Gertjan van Noord en Jan Odijk
 
Open sonar martinreynaert
Open sonar martinreynaertOpen sonar martinreynaert
Open sonar martinreynaert
 
Struc data Auke Rijpma
Struc data Auke RijpmaStruc data Auke Rijpma
Struc data Auke Rijpma
 
Diachronous conceptuallexicons Marieke van Erp / Piek Vossen
Diachronous conceptuallexicons Marieke van Erp / Piek VossenDiachronous conceptuallexicons Marieke van Erp / Piek Vossen
Diachronous conceptuallexicons Marieke van Erp / Piek Vossen
 
Corpus studio Erwin Komen
Corpus studio Erwin KomenCorpus studio Erwin Komen
Corpus studio Erwin Komen
 
Athena richard zijdeman
Athena richard zijdemanAthena richard zijdeman
Athena richard zijdeman
 
Struc data aukerijpma
Struc data aukerijpmaStruc data aukerijpma
Struc data aukerijpma
 
Anansi jauco noordzij
Anansi jauco noordzijAnansi jauco noordzij
Anansi jauco noordzij
 
Clariah dag 2016_wp1_ocw
Clariah dag 2016_wp1_ocwClariah dag 2016_wp1_ocw
Clariah dag 2016_wp1_ocw
 

Recently uploaded

Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
Industrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training ReportIndustrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training Report
Avinash Rai
 

Recently uploaded (20)

2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx
 
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringBasic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
 
Research Methods in Psychology | Cambridge AS Level | Cambridge Assessment In...
Research Methods in Psychology | Cambridge AS Level | Cambridge Assessment In...Research Methods in Psychology | Cambridge AS Level | Cambridge Assessment In...
Research Methods in Psychology | Cambridge AS Level | Cambridge Assessment In...
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTelling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Benefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational ResourcesBenefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational Resources
 
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
 
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxMatatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
 
Application of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesApplication of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matrices
 
size separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceuticssize separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceutics
 
The Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. HenryThe Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. Henry
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
B.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdfB.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdf
 
Industrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training ReportIndustrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training Report
 
[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 

2016 05-20-clariah-wp4

  • 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . WP4: the structured datahub: linked data, big and small Auke Rijpma, a.rijpma@uu.nl May 20, 2016
  • 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Today ▶ Data problem in ES(D)H ▶ Linked-data solution ▶ Demos interaction triplestore
  • 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems to solve ▶ Economic, social, and demographic historians have a long tradition of data-intensive research. ▶ Two issues: 1. As databases grow bigger and more complex, working with them becomes more difficult. Examples: HSN, CamPop, NAPP. 2. There are many small-to-medium size datasets that are isolated (exist on one computer, one repository, or are not harmonised with other datasets): the “long tail” of research data. ▶ Difficult to describe, share, and replicate research results. ▶ Equally important: difficult to answer questions that span more than one dataset (comparative or multilevel research).
  • 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Disconnected data Data Preparation Common Motifs in Scientific Workflows: An Empirical Analysis Daniel Garijo⇤, Pinar Alper †, Khalid Belhajjame†, Oscar Corcho⇤, Yolanda Gil‡, Carole Goble† ⇤Ontology Engineering Group, Universidad Polit´ecnica de Madrid. {dgarijo, ocorcho}@fi.upm.es †School of Computer Science, University of Manchester. {alperp, khalidb, carole.goble}@cs.manchester.ac.uk ‡Information Sciences Institute, Department of Computer Science, University of Southern California. gil@isi.edu Abstract—While workflow technology has gained momentum in the last decade as a means for specifying and enacting compu- tational experiments in modern science, reusing and repurposing existing workflows to build new scientific experiments is still a daunting task. This is partly due to the difficulty that scientists experience when attempting to understand existing workflows, which contain several data preparation and adaptation steps in addition to the scientifically significant analysis steps. One way to tackle the understandability problem is through providing abstractions that give a high-level view of activities undertaken within workflows. As a first step towards abstractions, we report in this paper on the results of a manual analysis performed over a set of real-world scientific workflows from Taverna and Wings systems. Our analysis has resulted in a set of scientific workflow motifs that outline i) the kinds of data intensive activities that are observed in workflows (data oriented motifs), and ii) the different manners in which activities are implemented within workflows (workflow oriented motifs). These motifs can be useful to inform workflow designers on the good and bad practices for workflow development, to inform the design of automated tools for the generation of workflow abstractions, etc. I. INTRODUCTION Scientific workflows have been increasingly used in the last decade as an instrument for data intensive scientific analysis. In these settings, workflows serve a dual function: first as detailed documentation of the method (i. e. the input sources and processing steps taken for the derivation of a certain data item) and second as re-usable, executable artifacts for data-intensive analysis. Workflows stitch together a variety of data manipulation activities such as data movement, data transformation or data visualization to serve the goals of the scientific study. The stitching is realized by the constructs made available by the workflow system used and is largely shaped by the environment in which the system operates and the function undertaken by the workflow. A variety of workflow systems are in use [10] [3] [7] [2] serving several scientific disciplines. A workflow is a software artifact, and as such once developed and tested, it can be shared and exchanged between scientists. Other scientists can then reuse existing workflows in their experiments, e.g., as sub-workflows [17]. Workflow reuse presents several advan- tages [4]. For example, it enables proper data citation and [14] and CrowdLabs [8] have made publishing and finding workflows easier, but scientists still face the challenges of re- use, which amounts to fully understanding and exploiting the available workflows/fragments. One difficulty in understanding workflows is their complex nature. A workflow may contain several scientifically-significant analysis steps, combined with various other data preparation activities, and in different implementation styles depending on the environment and context in which the workflow is executed. The difficulty in understanding causes workflow developers to revert to starting from scratch rather than re-using existing fragments. Through an analysis of the current practices in scientific workflow development, we could gain insights on the creation of understandable and more effectively re-usable workflows. Specifically, we propose an analysis with the following objec- tives: 1) To reverse-engineer the set of current practices in work- flow development through an analysis of empirical evi- dence. 2) To identify workflow abstractions that would facilitate understandability and therefore effective re-use. 3) To detect potential information sources and heuristics that can be used to inform the development of tools for creating workflow abstractions. In this paper we present the result of an empirical analysis performed over 177 workflow descriptions from Taverna [10] and Wings [3]. Based on this analysis, we propose a catalogue of scientific workflow motifs. Motifs are provided through i) a characterization of the kinds of data-oriented activities that are carried out within workflows, which we refer to as data- oriented motifs, and ii) a characterization of the different man- ners in which those activity motifs are realized/implemented within workflows, which we refer to as workflow-oriented motifs. It is worth mentioning that, although important, motifs that have to do with scheduling and mapping of workflows onto distributed resources [12] are out the scope of this paper. The paper is structured as follows. We begin by providing related work in Section II, which is followed in Section III by brief background information on Scientific Workflows, and the Fig. 3. Distribution of Data-Oriented Motifs per domain Fig. 5. Data Fig. 3. Distribution of Data-Oriented Motifs per domain Fig. 5. Data Preparation Motifs in the Genomics Workflows
  • 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The team Rinke Hoekstra, Kathrin Dentler, Albert Meroño Peñuela (VU), Laurens Rietveld (Triply), Richard Zijdeman, Ashkan Ashkpour (IISH).
  • 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linked data as a solution ▶ To solve this, we use web-based linked data technology. ▶ Method of publishing data on the web so that it can be interlinked and given semantic meaning. + : Very flexible, sidesteps harmonisations, very expressive query language (SPARQL), cross-database queries (even on different servers), live querying, browseable database, ability to combine metadata and “codebook” with actual data. – : not optimised for very big databases (10m+ observations → 100m triples), unfamiliar technology.
  • 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Plan ▶ Offer updated versions of important databases for economic and social historians (HSN, Clio-Infra, Campop, Mosaic, Henry-Fleury, CMGPD, Opgaafrollen, etc.) in one place, as linked data, in an accessible way:. ▶ Allow users to upload and share their datasets. ▶ Allow and encourage users to link datasets to other datasets or important standards to grow a graph of connected datasets. ▶ Provide direct, browseable, queryable access and tooling for visualisation and analysis. Empower Individual Researchers • Augment and link individual datasets according to best practices of the community or against colleagues • Share machine-interpretable code books with fellow researchers • Align codes and identifiers across datasets • Publish standards-compliant, reusable datasets Grow a giant graph of interconnected datasets
  • 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demos Demonstrate triplestore and tools to interact with it. QBer: http://qber.clariah-sdh.eculture.labs.vu.nl Brwsr: http://data.clariah-sdh.eculture.labs.vu.nl/doc/ resource/napp/observation/canada1891/62489 YASGUI: http://virtuoso.clariah-sdh.eculture.labs.vu.nl/ yasgui-auth/ Grlc: http://grlc.clariah-sdh.eculture.labs.vu.nl/ CLARIAH/wp4-queries/api-docs
  • 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demos ▶ QBer: “Connect your data to the cloud” (Rinke) 1. Augment and link datasets according to best practice. 2. Align codes and identifiers across datasets. 3. Share machine-readable codebooks. 4. Publish standardised datasets. ▶ http://qber.clariah-sdh.eculture.labs.vu.nl ▶ http://inspector.clariah-sdh.eculture.labs.vu.nl
  • 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demos ▶ Brwsr: Lightweight linked data browser (Rinke). 1. Browse linked data as if it is a web page. ▶ http://data.clariah-sdh.eculture.labs.vu.nl/doc/ resource/napp/observation/canada1891/62489
  • 12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demos ▶ Querying: Virtuoso, SPARQL, and YASGUI (Kathrin, Laurens) 1. Virtuoso triplestore, currently with 1b+ triples (1 188 852 440) 2. SPARQL as expressive query language 3. YASGUI as feature-rich editor. ▶ http://virtuoso.clariah-sdh.eculture.labs.vu.nl/ yasgui-auth/
  • 13. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demos ▶ Grlc: git repository linked data API constructor (Albert). 1. Builds Web APIs using SPARQL queries stored in git repositories. 2. Store and share queries. 3. Parametrise queries to overcome unfamiliarity with Grlc. ▶ http://grlc.clariah-sdh.eculture.labs.vu.nl/ CLARIAH/wp4-queries/api-docs
  • 14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demos entify locally, extrapolate obally…? canada sweden (Intercept) 3.616*** 4.430*** (0.134) (0.033) log(gdppc) 0.036** -0.070*** (0.018) (0.004) I(age^2) -0.000*** -0.000*** 0.000 0.000 age 0.007*** 0.001*** 0.000 0.000 R2 0.013 0.021 Adj. R2 0.012 0.021 Num. obs. 36201 275127 RMSE 0.142 0.102 ● ● ● ● ● ● ● ● ● ● ● ●● ● 20 30 40 50 60 70 3.984.004.024.04 Canada age log(hiscam) ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 3.984.004.024.04 Canada log(gdppc) log(hiscam) ● ● ● ●● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●●●● ● ● ● ● ●●●● ● ● ● ●●● ● ●●● ● ● ● ● ● ●● ● ● 20 30 40 50 60 70 3.903.943.984.02 Sweden age log(hiscam) ● ●●● ● ●●● ●● ● ●●● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6.8 6.9 7.0 7.1 7.2 7.3 3.903.943.984.02 Sweden log(gdppc) log(hiscam)
  • 15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future work ▶ Continue working with researchers (Ivo Zandhuis, Ruben Schalk) for use cases and spread the gospel. ▶ Increase data volume. ▶ Make sure triplestore can handle data volume. ▶ Make all components work together. ▶ Make a good (appealing and accessible) interface. More info: ▶ http://datalegend.net ▶ https://github.com/clariah ▶ See(n) us at conferences: ESSHC, Posthumus, WHiSe, …