SlideShare a Scribd company logo
1 of 24
Download to read offline
dataLegend
A CLARIAH work package
DH BeNeLux: Data Integration Challenges
Utrecht University
Utrecht, the Netherlands
July 3, 2017
@rlzijdeman #dhb2017
Data Preparation
Common Motifs in Scientific Workflows:
An Empirical Analysis
Daniel Garijo⇤, Pinar Alper †, Khalid Belhajjame†, Oscar Corcho⇤, Yolanda Gil‡, Carole Goble†
⇤Ontology Engineering Group, Universidad Polit´ecnica de Madrid. {dgarijo, ocorcho}@fi.upm.es
†School of Computer Science, University of Manchester. {alperp, khalidb, carole.goble}@cs.manchester.ac.uk
‡Information Sciences Institute, Department of Computer Science, University of Southern California. gil@isi.edu
Abstract—While workflow technology has gained momentum
in the last decade as a means for specifying and enacting compu-
tational experiments in modern science, reusing and repurposing
existing workflows to build new scientific experiments is still a
daunting task. This is partly due to the difficulty that scientists
experience when attempting to understand existing workflows,
which contain several data preparation and adaptation steps in
addition to the scientifically significant analysis steps. One way
to tackle the understandability problem is through providing
abstractions that give a high-level view of activities undertaken
within workflows. As a first step towards abstractions, we report
in this paper on the results of a manual analysis performed over
a set of real-world scientific workflows from Taverna and Wings
systems. Our analysis has resulted in a set of scientific workflow
motifs that outline i) the kinds of data intensive activities that are
observed in workflows (data oriented motifs), and ii) the different
manners in which activities are implemented within workflows
(workflow oriented motifs). These motifs can be useful to inform
workflow designers on the good and bad practices for workflow
development, to inform the design of automated tools for the
generation of workflow abstractions, etc.
I. INTRODUCTION
Scientific workflows have been increasingly used in the last
decade as an instrument for data intensive scientific analysis.
In these settings, workflows serve a dual function: first as
detailed documentation of the method (i. e. the input sources
and processing steps taken for the derivation of a certain
data item) and second as re-usable, executable artifacts for
data-intensive analysis. Workflows stitch together a variety
of data manipulation activities such as data movement, data
transformation or data visualization to serve the goals of the
scientific study. The stitching is realized by the constructs
made available by the workflow system used and is largely
shaped by the environment in which the system operates and
the function undertaken by the workflow.
A variety of workflow systems are in use [10] [3] [7] [2]
serving several scientific disciplines. A workflow is a software
[14] and CrowdLabs [8] have made publishing and finding
workflows easier, but scientists still face the challenges of re-
use, which amounts to fully understanding and exploiting the
available workflows/fragments. One difficulty in understanding
workflows is their complex nature. A workflow may contain
several scientifically-significant analysis steps, combined with
various other data preparation activities, and in different
implementation styles depending on the environment and
context in which the workflow is executed. The difficulty in
understanding causes workflow developers to revert to starting
from scratch rather than re-using existing fragments.
Through an analysis of the current practices in scientific
workflow development, we could gain insights on the creation
of understandable and more effectively re-usable workflows.
Specifically, we propose an analysis with the following objec-
tives:
1) To reverse-engineer the set of current practices in work-
flow development through an analysis of empirical evi-
dence.
2) To identify workflow abstractions that would facilitate
understandability and therefore effective re-use.
3) To detect potential information sources and heuristics
that can be used to inform the development of tools for
creating workflow abstractions.
In this paper we present the result of an empirical analysis
performed over 177 workflow descriptions from Taverna [10]
and Wings [3]. Based on this analysis, we propose a catalogue
of scientific workflow motifs. Motifs are provided through i)
a characterization of the kinds of data-oriented activities that
are carried out within workflows, which we refer to as data-
oriented motifs, and ii) a characterization of the different man-
ners in which those activity motifs are realized/implemented
within workflows, which we refer to as workflow-oriented
motifs. It is worth mentioning that, although important, motifs
Fig. 3. Distribution of Data-Oriented Motifs per domain
Fig. 3. Distribution of Data-Oriented Motifs per domain Fig. 5. Data Preparation Motifs in the Genomics Workflows
We do this
repeatedly
for the
same
datasets!
“Throwaway Science”
Rinke Hoekstra (VU Amsterdam)
the problem:
“long tail of big data”
Big datasets…
• NAPP, Mosaic, IPUMS etc. solve this for large
datasets
• But this is very expensive
• And the results are not mutually compatible
• Or worse… the compatibility is contested
The problem we’re trying to solve…
• Many interesting datasets are messy, incomplete and incorrect
• Data analysis requires clean data
• Cleaning data involves careful interpretation and study
• Values and variables in the data are replaced with (more)
standard terms (coding)
• Cross-dataset analyses requires a further data harmonization
step
• This ‘data preparation’ step can take up to 60% of the total
work
Our aims
• simplify linking datasets
• broaden connectivity of datasets
• simplify replication of research
aye B see da be c di
A B C D
aye B see da be c di
A B C D
An example
Variable name:
• gender
• sex
• geslacht
Values:
• 1,0
• male, female
• m, v
Values:
• sdmx-code:sex-v
• sdmx-code:sex-m
Variable name:
• sdmx-dimension:sex
aye B see da be c di
A B C D
aye B see da be c di
A B C D
?
source: http://www.smallpc.net/wp-content/uploads/2012/10/worldmap-net2.jpg
A B C D
create - view - query - store & serve
Linked Data tools
creating Linked Data
QBer
CoW
viewing Linked Data
brwsr
inspector
querying Linked Data
grlc.io
SPARQL2git sparql2git.com
storing & serving Linked Data
DRUID
broadening Linked Data
vocabularies
HISCO LICR
broadening Linked Data
datasets Catasto
CEDAR
CLIO-INFRA
CMGPD
Gemeentegeschiedenis.nl
HSN (demo)
HISCO foreign languages
HISCAM
HISCLASS
Thombos
so what we can do now?
Get me the data to answer this question:
What’s the influence of someone’s religion on social
position, controlled for a country’s economic
position and time?
https://github.com/CLARIAH/wp4-queries/
questions?
@rlzijdeman #dhb2017

More Related Content

What's hot

Berlin 6 Open Access Conference: Patrick Vandewalle
Berlin 6 Open Access Conference: Patrick VandewalleBerlin 6 Open Access Conference: Patrick Vandewalle
Berlin 6 Open Access Conference: Patrick VandewalleCornelius Puschmann
 
Handling Big Data Using a Data-Aware HDFS and Evolutionary Clustering Technique
Handling Big Data Using a Data-Aware HDFS and Evolutionary Clustering TechniqueHandling Big Data Using a Data-Aware HDFS and Evolutionary Clustering Technique
Handling Big Data Using a Data-Aware HDFS and Evolutionary Clustering TechniqueJAYAPRAKASH JPINFOTECH
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataRinke Hoekstra
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryIan Foster
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the WebRinke Hoekstra
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningPaul Groth
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
 
The MGI and AI
The MGI and AIThe MGI and AI
The MGI and AIaimsnist
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsPaul Groth
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Présentation PowerPoint - Diapositive 1
Présentation PowerPoint - Diapositive 1Présentation PowerPoint - Diapositive 1
Présentation PowerPoint - Diapositive 1butest
 
Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Anubhav Jain
 
Big data analysis and modelling
Big data analysis and modellingBig data analysis and modelling
Big data analysis and modellingkeivan mahdavi
 
Project Name
Project NameProject Name
Project Namebutest
 
KunGao_Resume.
KunGao_Resume.KunGao_Resume.
KunGao_Resume.Gao Kun
 

What's hot (20)

Berlin 6 Open Access Conference: Patrick Vandewalle
Berlin 6 Open Access Conference: Patrick VandewalleBerlin 6 Open Access Conference: Patrick Vandewalle
Berlin 6 Open Access Conference: Patrick Vandewalle
 
Handling Big Data Using a Data-Aware HDFS and Evolutionary Clustering Technique
Handling Big Data Using a Data-Aware HDFS and Evolutionary Clustering TechniqueHandling Big Data Using a Data-Aware HDFS and Evolutionary Clustering Technique
Handling Big Data Using a Data-Aware HDFS and Evolutionary Clustering Technique
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities Data
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Towards a Query-by-Example System for Knowledge Graphs
Towards a Query-by-Example System for Knowledge GraphsTowards a Query-by-Example System for Knowledge Graphs
Towards a Query-by-Example System for Knowledge Graphs
 
NoSQL (Not Only SQL)
NoSQL (Not Only SQL)NoSQL (Not Only SQL)
NoSQL (Not Only SQL)
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
The MGI and AI
The MGI and AIThe MGI and AI
The MGI and AI
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Présentation PowerPoint - Diapositive 1
Présentation PowerPoint - Diapositive 1Présentation PowerPoint - Diapositive 1
Présentation PowerPoint - Diapositive 1
 
Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...
 
Big data analysis and modelling
Big data analysis and modellingBig data analysis and modelling
Big data analysis and modelling
 
Project Name
Project NameProject Name
Project Name
 
KunGao_Resume.
KunGao_Resume.KunGao_Resume.
KunGao_Resume.
 

Similar to Data legend dh_benelux_2017.key

Wehc - Linked Data for Economic-Social historians
Wehc - Linked Data for Economic-Social historiansWehc - Linked Data for Economic-Social historians
Wehc - Linked Data for Economic-Social historiansBram van den Hout
 
2016 05-20-clariah-wp4
2016 05-20-clariah-wp42016 05-20-clariah-wp4
2016 05-20-clariah-wp4CLARIAH
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science researchAnubhav Jain
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Bertram Ludäscher
 
Knowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceKnowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceDavid De Roure
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Open Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and ExchangeOpen Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and Exchangelagoze
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLAnubhav Jain
 
The web of data: how are we doing so far
The web of data: how are we doing so farThe web of data: how are we doing so far
The web of data: how are we doing so farElena Simperl
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep LearningAndre Freitas
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...BRNSSPublicationHubI
 

Similar to Data legend dh_benelux_2017.key (20)

Wehc - Linked Data for Economic-Social historians
Wehc - Linked Data for Economic-Social historiansWehc - Linked Data for Economic-Social historians
Wehc - Linked Data for Economic-Social historians
 
2016 05-20-clariah-wp4
2016 05-20-clariah-wp42016 05-20-clariah-wp4
2016 05-20-clariah-wp4
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
 
Aussois bda-mdd-2018
Aussois bda-mdd-2018Aussois bda-mdd-2018
Aussois bda-mdd-2018
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
 
Knowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceKnowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems Science
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Open Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and ExchangeOpen Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and Exchange
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
The web of data: how are we doing so far
The web of data: how are we doing so farThe web of data: how are we doing so far
The web of data: how are we doing so far
 
Credible workshop
Credible workshopCredible workshop
Credible workshop
 
E43022023
E43022023E43022023
E43022023
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
 

More from Richard Zijdeman

grlc. store, share and run sparql queries
grlc. store, share and run sparql queriesgrlc. store, share and run sparql queries
grlc. store, share and run sparql queriesRichard Zijdeman
 
Rijpma's Catasto meets SPARQL dhb2017_workshop
Rijpma's Catasto meets SPARQL dhb2017_workshopRijpma's Catasto meets SPARQL dhb2017_workshop
Rijpma's Catasto meets SPARQL dhb2017_workshopRichard Zijdeman
 
Historical occupational classification and occupational stratification schemes
Historical occupational classification and occupational stratification schemesHistorical occupational classification and occupational stratification schemes
Historical occupational classification and occupational stratification schemesRichard Zijdeman
 
Labour force participation of married women, US 1860-2010
Labour force participation of married women, US 1860-2010Labour force participation of married women, US 1860-2010
Labour force participation of married women, US 1860-2010Richard Zijdeman
 
Advancing the comparability of occupational data through Linked Open Data
Advancing the comparability of occupational data through Linked Open DataAdvancing the comparability of occupational data through Linked Open Data
Advancing the comparability of occupational data through Linked Open DataRichard Zijdeman
 
work in a globalized world
work in a globalized worldwork in a globalized world
work in a globalized worldRichard Zijdeman
 
The Structured Data Hub in 2019
The Structured Data Hub in 2019The Structured Data Hub in 2019
The Structured Data Hub in 2019Richard Zijdeman
 
Examples of digital history at the IISH
Examples of digital history at the IISHExamples of digital history at the IISH
Examples of digital history at the IISHRichard Zijdeman
 
Introduction into R for historians (part 4: data manipulation)
Introduction into R for historians (part 4: data manipulation)Introduction into R for historians (part 4: data manipulation)
Introduction into R for historians (part 4: data manipulation)Richard Zijdeman
 
Introduction into R for historians (part 3: examine and import data)
Introduction into R for historians (part 3: examine and import data)Introduction into R for historians (part 3: examine and import data)
Introduction into R for historians (part 3: examine and import data)Richard Zijdeman
 
Introduction into R for historians (part 1: introduction)
Introduction into R for historians (part 1: introduction)Introduction into R for historians (part 1: introduction)
Introduction into R for historians (part 1: introduction)Richard Zijdeman
 
Historical occupational classification and stratification schemes (lecture)
Historical occupational classification and stratification schemes (lecture)Historical occupational classification and stratification schemes (lecture)
Historical occupational classification and stratification schemes (lecture)Richard Zijdeman
 
Using HISCO and HISCAM to code and analyze occupations
Using HISCO and HISCAM to code and analyze occupationsUsing HISCO and HISCAM to code and analyze occupations
Using HISCO and HISCAM to code and analyze occupationsRichard Zijdeman
 

More from Richard Zijdeman (16)

grlc. store, share and run sparql queries
grlc. store, share and run sparql queriesgrlc. store, share and run sparql queries
grlc. store, share and run sparql queries
 
Rijpma's Catasto meets SPARQL dhb2017_workshop
Rijpma's Catasto meets SPARQL dhb2017_workshopRijpma's Catasto meets SPARQL dhb2017_workshop
Rijpma's Catasto meets SPARQL dhb2017_workshop
 
Toogdag 2017
Toogdag 2017Toogdag 2017
Toogdag 2017
 
Historical occupational classification and occupational stratification schemes
Historical occupational classification and occupational stratification schemesHistorical occupational classification and occupational stratification schemes
Historical occupational classification and occupational stratification schemes
 
Basic introduction into R
Basic introduction into RBasic introduction into R
Basic introduction into R
 
Labour force participation of married women, US 1860-2010
Labour force participation of married women, US 1860-2010Labour force participation of married women, US 1860-2010
Labour force participation of married women, US 1860-2010
 
Advancing the comparability of occupational data through Linked Open Data
Advancing the comparability of occupational data through Linked Open DataAdvancing the comparability of occupational data through Linked Open Data
Advancing the comparability of occupational data through Linked Open Data
 
work in a globalized world
work in a globalized worldwork in a globalized world
work in a globalized world
 
The Structured Data Hub in 2019
The Structured Data Hub in 2019The Structured Data Hub in 2019
The Structured Data Hub in 2019
 
Examples of digital history at the IISH
Examples of digital history at the IISHExamples of digital history at the IISH
Examples of digital history at the IISH
 
Introduction into R for historians (part 4: data manipulation)
Introduction into R for historians (part 4: data manipulation)Introduction into R for historians (part 4: data manipulation)
Introduction into R for historians (part 4: data manipulation)
 
Introduction into R for historians (part 3: examine and import data)
Introduction into R for historians (part 3: examine and import data)Introduction into R for historians (part 3: examine and import data)
Introduction into R for historians (part 3: examine and import data)
 
Introduction into R for historians (part 1: introduction)
Introduction into R for historians (part 1: introduction)Introduction into R for historians (part 1: introduction)
Introduction into R for historians (part 1: introduction)
 
Historical occupational classification and stratification schemes (lecture)
Historical occupational classification and stratification schemes (lecture)Historical occupational classification and stratification schemes (lecture)
Historical occupational classification and stratification schemes (lecture)
 
Using HISCO and HISCAM to code and analyze occupations
Using HISCO and HISCAM to code and analyze occupationsUsing HISCO and HISCAM to code and analyze occupations
Using HISCO and HISCAM to code and analyze occupations
 
Csdh sbg clariah_intr01
Csdh sbg clariah_intr01Csdh sbg clariah_intr01
Csdh sbg clariah_intr01
 

Recently uploaded

Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxSilpa
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry Areesha Ahmad
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Silpa
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.Silpa
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Silpa
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Silpa
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLkantirani197
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body Areesha Ahmad
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxSilpa
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptxSilpa
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 

Recently uploaded (20)

Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 

Data legend dh_benelux_2017.key

  • 1. dataLegend A CLARIAH work package DH BeNeLux: Data Integration Challenges Utrecht University Utrecht, the Netherlands July 3, 2017 @rlzijdeman #dhb2017
  • 2. Data Preparation Common Motifs in Scientific Workflows: An Empirical Analysis Daniel Garijo⇤, Pinar Alper †, Khalid Belhajjame†, Oscar Corcho⇤, Yolanda Gil‡, Carole Goble† ⇤Ontology Engineering Group, Universidad Polit´ecnica de Madrid. {dgarijo, ocorcho}@fi.upm.es †School of Computer Science, University of Manchester. {alperp, khalidb, carole.goble}@cs.manchester.ac.uk ‡Information Sciences Institute, Department of Computer Science, University of Southern California. gil@isi.edu Abstract—While workflow technology has gained momentum in the last decade as a means for specifying and enacting compu- tational experiments in modern science, reusing and repurposing existing workflows to build new scientific experiments is still a daunting task. This is partly due to the difficulty that scientists experience when attempting to understand existing workflows, which contain several data preparation and adaptation steps in addition to the scientifically significant analysis steps. One way to tackle the understandability problem is through providing abstractions that give a high-level view of activities undertaken within workflows. As a first step towards abstractions, we report in this paper on the results of a manual analysis performed over a set of real-world scientific workflows from Taverna and Wings systems. Our analysis has resulted in a set of scientific workflow motifs that outline i) the kinds of data intensive activities that are observed in workflows (data oriented motifs), and ii) the different manners in which activities are implemented within workflows (workflow oriented motifs). These motifs can be useful to inform workflow designers on the good and bad practices for workflow development, to inform the design of automated tools for the generation of workflow abstractions, etc. I. INTRODUCTION Scientific workflows have been increasingly used in the last decade as an instrument for data intensive scientific analysis. In these settings, workflows serve a dual function: first as detailed documentation of the method (i. e. the input sources and processing steps taken for the derivation of a certain data item) and second as re-usable, executable artifacts for data-intensive analysis. Workflows stitch together a variety of data manipulation activities such as data movement, data transformation or data visualization to serve the goals of the scientific study. The stitching is realized by the constructs made available by the workflow system used and is largely shaped by the environment in which the system operates and the function undertaken by the workflow. A variety of workflow systems are in use [10] [3] [7] [2] serving several scientific disciplines. A workflow is a software [14] and CrowdLabs [8] have made publishing and finding workflows easier, but scientists still face the challenges of re- use, which amounts to fully understanding and exploiting the available workflows/fragments. One difficulty in understanding workflows is their complex nature. A workflow may contain several scientifically-significant analysis steps, combined with various other data preparation activities, and in different implementation styles depending on the environment and context in which the workflow is executed. The difficulty in understanding causes workflow developers to revert to starting from scratch rather than re-using existing fragments. Through an analysis of the current practices in scientific workflow development, we could gain insights on the creation of understandable and more effectively re-usable workflows. Specifically, we propose an analysis with the following objec- tives: 1) To reverse-engineer the set of current practices in work- flow development through an analysis of empirical evi- dence. 2) To identify workflow abstractions that would facilitate understandability and therefore effective re-use. 3) To detect potential information sources and heuristics that can be used to inform the development of tools for creating workflow abstractions. In this paper we present the result of an empirical analysis performed over 177 workflow descriptions from Taverna [10] and Wings [3]. Based on this analysis, we propose a catalogue of scientific workflow motifs. Motifs are provided through i) a characterization of the kinds of data-oriented activities that are carried out within workflows, which we refer to as data- oriented motifs, and ii) a characterization of the different man- ners in which those activity motifs are realized/implemented within workflows, which we refer to as workflow-oriented motifs. It is worth mentioning that, although important, motifs Fig. 3. Distribution of Data-Oriented Motifs per domain Fig. 3. Distribution of Data-Oriented Motifs per domain Fig. 5. Data Preparation Motifs in the Genomics Workflows
  • 3. We do this repeatedly for the same datasets!
  • 5. the problem: “long tail of big data”
  • 6. Big datasets… • NAPP, Mosaic, IPUMS etc. solve this for large datasets • But this is very expensive • And the results are not mutually compatible • Or worse… the compatibility is contested
  • 7. The problem we’re trying to solve… • Many interesting datasets are messy, incomplete and incorrect • Data analysis requires clean data • Cleaning data involves careful interpretation and study • Values and variables in the data are replaced with (more) standard terms (coding) • Cross-dataset analyses requires a further data harmonization step • This ‘data preparation’ step can take up to 60% of the total work
  • 8. Our aims • simplify linking datasets • broaden connectivity of datasets • simplify replication of research
  • 9. aye B see da be c di A B C D
  • 10. aye B see da be c di A B C D
  • 11. An example Variable name: • gender • sex • geslacht Values: • 1,0 • male, female • m, v Values: • sdmx-code:sex-v • sdmx-code:sex-m Variable name: • sdmx-dimension:sex
  • 12. aye B see da be c di A B C D
  • 13. aye B see da be c di A B C D ?
  • 15. create - view - query - store & serve Linked Data tools
  • 19. storing & serving Linked Data DRUID
  • 21. broadening Linked Data datasets Catasto CEDAR CLIO-INFRA CMGPD Gemeentegeschiedenis.nl HSN (demo) HISCO foreign languages HISCAM HISCLASS Thombos
  • 22. so what we can do now? Get me the data to answer this question: What’s the influence of someone’s religion on social position, controlled for a country’s economic position and time?