SlideShare a Scribd company logo
1 of 10
Download to read offline
QBer 

Crowd Based Coding and Harmonization using Linked Data
Rinke Hoekstra and Albert Meroño-Peñuela
The problem we’re trying to solve…
• Many interesting datasets are messy, incomplete and incorrect

• Data analysis requires clean data
• Cleaning data involves careful interpretation and study

• Values and variables in the data are replaced with (more) standard terms
(coding)

• Cross-dataset analyses requires a further data harmonization step

• This ‘data preparation’ step can take up to 60% of the total work
Data Preparation
Common Motifs in Scientific Workflows:
An Empirical Analysis
Daniel Garijo⇤, Pinar Alper †, Khalid Belhajjame†, Oscar Corcho⇤, Yolanda Gil‡, Carole Goble†
⇤Ontology Engineering Group, Universidad Polit´ecnica de Madrid. {dgarijo, ocorcho}@fi.upm.es
†School of Computer Science, University of Manchester. {alperp, khalidb, carole.goble}@cs.manchester.ac.uk
‡Information Sciences Institute, Department of Computer Science, University of Southern California. gil@isi.edu
Abstract—While workflow technology has gained momentum
in the last decade as a means for specifying and enacting compu-
tational experiments in modern science, reusing and repurposing
existing workflows to build new scientific experiments is still a
daunting task. This is partly due to the difficulty that scientists
experience when attempting to understand existing workflows,
which contain several data preparation and adaptation steps in
addition to the scientifically significant analysis steps. One way
to tackle the understandability problem is through providing
abstractions that give a high-level view of activities undertaken
within workflows. As a first step towards abstractions, we report
in this paper on the results of a manual analysis performed over
a set of real-world scientific workflows from Taverna and Wings
systems. Our analysis has resulted in a set of scientific workflow
motifs that outline i) the kinds of data intensive activities that are
observed in workflows (data oriented motifs), and ii) the different
manners in which activities are implemented within workflows
(workflow oriented motifs). These motifs can be useful to inform
workflow designers on the good and bad practices for workflow
development, to inform the design of automated tools for the
generation of workflow abstractions, etc.
I. INTRODUCTION
Scientific workflows have been increasingly used in the last
decade as an instrument for data intensive scientific analysis.
In these settings, workflows serve a dual function: first as
detailed documentation of the method (i. e. the input sources
and processing steps taken for the derivation of a certain
data item) and second as re-usable, executable artifacts for
data-intensive analysis. Workflows stitch together a variety
of data manipulation activities such as data movement, data
transformation or data visualization to serve the goals of the
scientific study. The stitching is realized by the constructs
made available by the workflow system used and is largely
shaped by the environment in which the system operates and
the function undertaken by the workflow.
A variety of workflow systems are in use [10] [3] [7] [2]
serving several scientific disciplines. A workflow is a software
[14] and CrowdLabs [8] have made publishing and finding
workflows easier, but scientists still face the challenges of re-
use, which amounts to fully understanding and exploiting the
available workflows/fragments. One difficulty in understanding
workflows is their complex nature. A workflow may contain
several scientifically-significant analysis steps, combined with
various other data preparation activities, and in different
implementation styles depending on the environment and
context in which the workflow is executed. The difficulty in
understanding causes workflow developers to revert to starting
from scratch rather than re-using existing fragments.
Through an analysis of the current practices in scientific
workflow development, we could gain insights on the creation
of understandable and more effectively re-usable workflows.
Specifically, we propose an analysis with the following objec-
tives:
1) To reverse-engineer the set of current practices in work-
flow development through an analysis of empirical evi-
dence.
2) To identify workflow abstractions that would facilitate
understandability and therefore effective re-use.
3) To detect potential information sources and heuristics
that can be used to inform the development of tools for
creating workflow abstractions.
In this paper we present the result of an empirical analysis
performed over 177 workflow descriptions from Taverna [10]
and Wings [3]. Based on this analysis, we propose a catalogue
of scientific workflow motifs. Motifs are provided through i)
a characterization of the kinds of data-oriented activities that
are carried out within workflows, which we refer to as data-
oriented motifs, and ii) a characterization of the different man-
ners in which those activity motifs are realized/implemented
within workflows, which we refer to as workflow-oriented
motifs. It is worth mentioning that, although important, motifs
Fig. 3. Distribution of Data-Oriented Motifs per domain
Fig. 3. Distribution of Data-Oriented Motifs per domain Fig. 5. Data Preparation Motifs in the Genomics Workflows
We do this repeatedly for the same datasets!
Big datasets…
• NAPP, Mosaic, IPUMS etc. solve this for large datasets

• But this is very expensive

• And the results are not mutually compatible

• Or worse… the compatibility is contested
What QBer does…
• Empower individual researchers to
• Code and harmonize individual datasets according to best practices of the
community (e.g. HISCO, SDMX, Worldbank, etc.) or against their colleagues

• Share their own code lists with fellow researchers

• Align code lists across datasets

• Publish their standards-compliant datasets on a Structured Data Hub
We use web-based linked data to grow a
giant graph of interconnected datasets
QBer’s Architecture
Exists
Frequency Table
Variabele does not yet existVariable
Mappings
Publish
Harmonize
Includes both external Linked Data and
standard vocabularies, e.g. World Bank
Structured Data Hub
External Data
Existing
Variables
Provenance tracking of all data
Legacy Systems
Browse
Screencast
https://vimeo.com/130322985
What you just saw
• Uploading of micro data dataset and extraction of variables and value
frequencies
• Gleaning of known variables and code lists from the Web

• Mapping of variable values to codes (while preserving the originals!)
• Publishing of dataset structure as Linked Data

• Provenance of all assertions to the SDH traceable to time and person

• Collaborative growing of a graph of interconnected datasets
Future benefits
• Automatic extraction of interesting data across datasets

• Opportunities for large scale cross-dataset studies

• Crowd-based production of code lists and mappings

• Reuse other people’s work (or stand on the shoulders of giants)

• No disposable research

More Related Content

More from Rinke Hoekstra

A Network Analysis of Dutch Regulations - Using the Metalex Document Server
A Network Analysis of Dutch Regulations - Using the Metalex Document ServerA Network Analysis of Dutch Regulations - Using the Metalex Document Server
A Network Analysis of Dutch Regulations - Using the Metalex Document ServerRinke Hoekstra
 
Linked (Open) Data - But what does it buy me?
Linked (Open) Data - But what does it buy me?Linked (Open) Data - But what does it buy me?
Linked (Open) Data - But what does it buy me?Rinke Hoekstra
 
Linked Science - Building a Web of Research Data
Linked Science - Building a Web of Research DataLinked Science - Building a Web of Research Data
Linked Science - Building a Web of Research DataRinke Hoekstra
 
Semantic Representations for Research
Semantic Representations for ResearchSemantic Representations for Research
Semantic Representations for ResearchRinke Hoekstra
 
A Slightly Different Web of Data
A Slightly Different Web of DataA Slightly Different Web of Data
A Slightly Different Web of DataRinke Hoekstra
 
The Knowledge Reengineering Bottleneck
The Knowledge Reengineering BottleneckThe Knowledge Reengineering Bottleneck
The Knowledge Reengineering BottleneckRinke Hoekstra
 
Concept- en Definitie Extractie
Concept- en Definitie ExtractieConcept- en Definitie Extractie
Concept- en Definitie ExtractieRinke Hoekstra
 
SIKS 2011 Semantic Web Languages
SIKS 2011 Semantic Web LanguagesSIKS 2011 Semantic Web Languages
SIKS 2011 Semantic Web LanguagesRinke Hoekstra
 
The MetaLex Document Server - Legal Documents as Versioned Linked Data
The MetaLex Document Server - Legal Documents as Versioned Linked DataThe MetaLex Document Server - Legal Documents as Versioned Linked Data
The MetaLex Document Server - Legal Documents as Versioned Linked DataRinke Hoekstra
 
Querying the Web of Data
Querying the Web of DataQuerying the Web of Data
Querying the Web of DataRinke Hoekstra
 
History of Knowledge Representation (SIKS Course 2010)
History of Knowledge Representation (SIKS Course 2010)History of Knowledge Representation (SIKS Course 2010)
History of Knowledge Representation (SIKS Course 2010)Rinke Hoekstra
 
Making Sense of Design Patterns
Making Sense of Design PatternsMaking Sense of Design Patterns
Making Sense of Design PatternsRinke Hoekstra
 
Publicatie van Linked Open Overheids Data
Publicatie van Linked Open Overheids DataPublicatie van Linked Open Overheids Data
Publicatie van Linked Open Overheids DataRinke Hoekstra
 
ODaF 2010 Linked Data in the Netherlands
ODaF 2010 Linked Data in the NetherlandsODaF 2010 Linked Data in the Netherlands
ODaF 2010 Linked Data in the NetherlandsRinke Hoekstra
 
Overzicht BEST Project - NWO Site Visit
Overzicht BEST Project - NWO Site VisitOverzicht BEST Project - NWO Site Visit
Overzicht BEST Project - NWO Site VisitRinke Hoekstra
 
Semantic Modelling using Semantic Web Technology
Semantic Modelling using Semantic Web TechnologySemantic Modelling using Semantic Web Technology
Semantic Modelling using Semantic Web TechnologyRinke Hoekstra
 
BestPortal: Lessons Learned in Lightweight Semantic Access to Court Proceedings
BestPortal: Lessons Learned in Lightweight Semantic Access to Court ProceedingsBestPortal: Lessons Learned in Lightweight Semantic Access to Court Proceedings
BestPortal: Lessons Learned in Lightweight Semantic Access to Court ProceedingsRinke Hoekstra
 
BestMap: Context-Aware SKOS Vocabulary Mappings in OWL 2
BestMap: Context-Aware SKOS Vocabulary Mappings in OWL 2BestMap: Context-Aware SKOS Vocabulary Mappings in OWL 2
BestMap: Context-Aware SKOS Vocabulary Mappings in OWL 2Rinke Hoekstra
 

More from Rinke Hoekstra (20)

A Network Analysis of Dutch Regulations - Using the Metalex Document Server
A Network Analysis of Dutch Regulations - Using the Metalex Document ServerA Network Analysis of Dutch Regulations - Using the Metalex Document Server
A Network Analysis of Dutch Regulations - Using the Metalex Document Server
 
Linked (Open) Data - But what does it buy me?
Linked (Open) Data - But what does it buy me?Linked (Open) Data - But what does it buy me?
Linked (Open) Data - But what does it buy me?
 
Linked Science - Building a Web of Research Data
Linked Science - Building a Web of Research DataLinked Science - Building a Web of Research Data
Linked Science - Building a Web of Research Data
 
COMMIT/VIVO
COMMIT/VIVOCOMMIT/VIVO
COMMIT/VIVO
 
Semantic Representations for Research
Semantic Representations for ResearchSemantic Representations for Research
Semantic Representations for Research
 
A Slightly Different Web of Data
A Slightly Different Web of DataA Slightly Different Web of Data
A Slightly Different Web of Data
 
The Knowledge Reengineering Bottleneck
The Knowledge Reengineering BottleneckThe Knowledge Reengineering Bottleneck
The Knowledge Reengineering Bottleneck
 
Linked Census Data
Linked Census DataLinked Census Data
Linked Census Data
 
Concept- en Definitie Extractie
Concept- en Definitie ExtractieConcept- en Definitie Extractie
Concept- en Definitie Extractie
 
SIKS 2011 Semantic Web Languages
SIKS 2011 Semantic Web LanguagesSIKS 2011 Semantic Web Languages
SIKS 2011 Semantic Web Languages
 
The MetaLex Document Server - Legal Documents as Versioned Linked Data
The MetaLex Document Server - Legal Documents as Versioned Linked DataThe MetaLex Document Server - Legal Documents as Versioned Linked Data
The MetaLex Document Server - Legal Documents as Versioned Linked Data
 
Querying the Web of Data
Querying the Web of DataQuerying the Web of Data
Querying the Web of Data
 
History of Knowledge Representation (SIKS Course 2010)
History of Knowledge Representation (SIKS Course 2010)History of Knowledge Representation (SIKS Course 2010)
History of Knowledge Representation (SIKS Course 2010)
 
Making Sense of Design Patterns
Making Sense of Design PatternsMaking Sense of Design Patterns
Making Sense of Design Patterns
 
Publicatie van Linked Open Overheids Data
Publicatie van Linked Open Overheids DataPublicatie van Linked Open Overheids Data
Publicatie van Linked Open Overheids Data
 
ODaF 2010 Linked Data in the Netherlands
ODaF 2010 Linked Data in the NetherlandsODaF 2010 Linked Data in the Netherlands
ODaF 2010 Linked Data in the Netherlands
 
Overzicht BEST Project - NWO Site Visit
Overzicht BEST Project - NWO Site VisitOverzicht BEST Project - NWO Site Visit
Overzicht BEST Project - NWO Site Visit
 
Semantic Modelling using Semantic Web Technology
Semantic Modelling using Semantic Web TechnologySemantic Modelling using Semantic Web Technology
Semantic Modelling using Semantic Web Technology
 
BestPortal: Lessons Learned in Lightweight Semantic Access to Court Proceedings
BestPortal: Lessons Learned in Lightweight Semantic Access to Court ProceedingsBestPortal: Lessons Learned in Lightweight Semantic Access to Court Proceedings
BestPortal: Lessons Learned in Lightweight Semantic Access to Court Proceedings
 
BestMap: Context-Aware SKOS Vocabulary Mappings in OWL 2
BestMap: Context-Aware SKOS Vocabulary Mappings in OWL 2BestMap: Context-Aware SKOS Vocabulary Mappings in OWL 2
BestMap: Context-Aware SKOS Vocabulary Mappings in OWL 2
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

QBer - Crowd Based Coding and Harmonization using Linked Data

  • 1. QBer 
 Crowd Based Coding and Harmonization using Linked Data Rinke Hoekstra and Albert Meroño-Peñuela
  • 2. The problem we’re trying to solve… • Many interesting datasets are messy, incomplete and incorrect • Data analysis requires clean data • Cleaning data involves careful interpretation and study • Values and variables in the data are replaced with (more) standard terms (coding) • Cross-dataset analyses requires a further data harmonization step • This ‘data preparation’ step can take up to 60% of the total work
  • 3. Data Preparation Common Motifs in Scientific Workflows: An Empirical Analysis Daniel Garijo⇤, Pinar Alper †, Khalid Belhajjame†, Oscar Corcho⇤, Yolanda Gil‡, Carole Goble† ⇤Ontology Engineering Group, Universidad Polit´ecnica de Madrid. {dgarijo, ocorcho}@fi.upm.es †School of Computer Science, University of Manchester. {alperp, khalidb, carole.goble}@cs.manchester.ac.uk ‡Information Sciences Institute, Department of Computer Science, University of Southern California. gil@isi.edu Abstract—While workflow technology has gained momentum in the last decade as a means for specifying and enacting compu- tational experiments in modern science, reusing and repurposing existing workflows to build new scientific experiments is still a daunting task. This is partly due to the difficulty that scientists experience when attempting to understand existing workflows, which contain several data preparation and adaptation steps in addition to the scientifically significant analysis steps. One way to tackle the understandability problem is through providing abstractions that give a high-level view of activities undertaken within workflows. As a first step towards abstractions, we report in this paper on the results of a manual analysis performed over a set of real-world scientific workflows from Taverna and Wings systems. Our analysis has resulted in a set of scientific workflow motifs that outline i) the kinds of data intensive activities that are observed in workflows (data oriented motifs), and ii) the different manners in which activities are implemented within workflows (workflow oriented motifs). These motifs can be useful to inform workflow designers on the good and bad practices for workflow development, to inform the design of automated tools for the generation of workflow abstractions, etc. I. INTRODUCTION Scientific workflows have been increasingly used in the last decade as an instrument for data intensive scientific analysis. In these settings, workflows serve a dual function: first as detailed documentation of the method (i. e. the input sources and processing steps taken for the derivation of a certain data item) and second as re-usable, executable artifacts for data-intensive analysis. Workflows stitch together a variety of data manipulation activities such as data movement, data transformation or data visualization to serve the goals of the scientific study. The stitching is realized by the constructs made available by the workflow system used and is largely shaped by the environment in which the system operates and the function undertaken by the workflow. A variety of workflow systems are in use [10] [3] [7] [2] serving several scientific disciplines. A workflow is a software [14] and CrowdLabs [8] have made publishing and finding workflows easier, but scientists still face the challenges of re- use, which amounts to fully understanding and exploiting the available workflows/fragments. One difficulty in understanding workflows is their complex nature. A workflow may contain several scientifically-significant analysis steps, combined with various other data preparation activities, and in different implementation styles depending on the environment and context in which the workflow is executed. The difficulty in understanding causes workflow developers to revert to starting from scratch rather than re-using existing fragments. Through an analysis of the current practices in scientific workflow development, we could gain insights on the creation of understandable and more effectively re-usable workflows. Specifically, we propose an analysis with the following objec- tives: 1) To reverse-engineer the set of current practices in work- flow development through an analysis of empirical evi- dence. 2) To identify workflow abstractions that would facilitate understandability and therefore effective re-use. 3) To detect potential information sources and heuristics that can be used to inform the development of tools for creating workflow abstractions. In this paper we present the result of an empirical analysis performed over 177 workflow descriptions from Taverna [10] and Wings [3]. Based on this analysis, we propose a catalogue of scientific workflow motifs. Motifs are provided through i) a characterization of the kinds of data-oriented activities that are carried out within workflows, which we refer to as data- oriented motifs, and ii) a characterization of the different man- ners in which those activity motifs are realized/implemented within workflows, which we refer to as workflow-oriented motifs. It is worth mentioning that, although important, motifs Fig. 3. Distribution of Data-Oriented Motifs per domain Fig. 3. Distribution of Data-Oriented Motifs per domain Fig. 5. Data Preparation Motifs in the Genomics Workflows
  • 4. We do this repeatedly for the same datasets!
  • 5. Big datasets… • NAPP, Mosaic, IPUMS etc. solve this for large datasets • But this is very expensive • And the results are not mutually compatible • Or worse… the compatibility is contested
  • 6. What QBer does… • Empower individual researchers to • Code and harmonize individual datasets according to best practices of the community (e.g. HISCO, SDMX, Worldbank, etc.) or against their colleagues • Share their own code lists with fellow researchers • Align code lists across datasets • Publish their standards-compliant datasets on a Structured Data Hub We use web-based linked data to grow a giant graph of interconnected datasets
  • 7. QBer’s Architecture Exists Frequency Table Variabele does not yet existVariable Mappings Publish Harmonize Includes both external Linked Data and standard vocabularies, e.g. World Bank Structured Data Hub External Data Existing Variables Provenance tracking of all data Legacy Systems Browse
  • 9. What you just saw • Uploading of micro data dataset and extraction of variables and value frequencies • Gleaning of known variables and code lists from the Web • Mapping of variable values to codes (while preserving the originals!) • Publishing of dataset structure as Linked Data • Provenance of all assertions to the SDH traceable to time and person • Collaborative growing of a graph of interconnected datasets
  • 10. Future benefits • Automatic extraction of interesting data across datasets • Opportunities for large scale cross-dataset studies • Crowd-based production of code lists and mappings • Reuse other people’s work (or stand on the shoulders of giants) • No disposable research