Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
QBer 

Crowd Based Coding and Harmonization using Linked Data
Rinke Hoekstra and Albert Meroño-Peñuela
The problem we’re trying to solve…
• Many interesting datasets are messy, incomplete and incorrect

• Data analysis requir...
Data Preparation
Common Motifs in Scientific Workflows:
An Empirical Analysis
Daniel Garijo⇤, Pinar Alper †, Khalid Belhajja...
We do this repeatedly for the same datasets!
Big datasets…
• NAPP, Mosaic, IPUMS etc. solve this for large datasets

• But this is very expensive

• And the results ar...
What QBer does…
• Empower individual researchers to
• Code and harmonize individual datasets according to best practices o...
QBer’s Architecture
Exists
Frequency Table
Variabele does not yet existVariable
Mappings
Publish
Harmonize
Includes both e...
Screencast
https://vimeo.com/130322985
What you just saw
• Uploading of micro data dataset and extraction of variables and value
frequencies
• Gleaning of known ...
Future benefits
• Automatic extraction of interesting data across datasets

• Opportunities for large scale cross-dataset s...
Upcoming SlideShare
Loading in …5
×

QBer - Crowd Based Coding and Harmonization using Linked Data

3,580 views

Published on

Screencast: https://vimeo.com/130322985

Data analysis requires clean data. But this comes with a price, and building a ready-to use clean dataset involves careful interpretation of often messy, incomplete and incorrect data, where values and variables are replaced with standard terms (coding) and units of measure. For analyses that rely on multiple datasets, a further data-harmonization step is needed. This is time and effort consuming work, and studies show that in some domains this 'data preparation' step can take up to 60% of the total work. To make matters worse, every individual researcher does this, every time a new dataset is studied.

To overcome this problem, important and big datasets are carefully curated and published in a standard, well documented form. Unfortunately there remain three problems: 1) this is very expensive, and 2) is therefore only done for the larger datasets, and 2) the various efforts are not necessarily mutually compatible.

For these reasons, we are developing QBer. A tool that will allow individual researchers to easily 1) code and harmonize their datasets according to best practices of the community, 2) share new code lists with fellow researchers, 3) align code lists across datasets, and 3) publish their datasets in the standards-compliant format on a Structured Data Hub. By reusing identifiers (codes, standard terms) across datasets, we will grow a large volume of interconnected datasets that are directly ready for use in analyses.

Published in: Technology
  • Be the first to comment

QBer - Crowd Based Coding and Harmonization using Linked Data

  1. 1. QBer 
 Crowd Based Coding and Harmonization using Linked Data Rinke Hoekstra and Albert Meroño-Peñuela
  2. 2. The problem we’re trying to solve… • Many interesting datasets are messy, incomplete and incorrect • Data analysis requires clean data • Cleaning data involves careful interpretation and study • Values and variables in the data are replaced with (more) standard terms (coding) • Cross-dataset analyses requires a further data harmonization step • This ‘data preparation’ step can take up to 60% of the total work
  3. 3. Data Preparation Common Motifs in Scientific Workflows: An Empirical Analysis Daniel Garijo⇤, Pinar Alper †, Khalid Belhajjame†, Oscar Corcho⇤, Yolanda Gil‡, Carole Goble† ⇤Ontology Engineering Group, Universidad Polit´ecnica de Madrid. {dgarijo, ocorcho}@fi.upm.es †School of Computer Science, University of Manchester. {alperp, khalidb, carole.goble}@cs.manchester.ac.uk ‡Information Sciences Institute, Department of Computer Science, University of Southern California. gil@isi.edu Abstract—While workflow technology has gained momentum in the last decade as a means for specifying and enacting compu- tational experiments in modern science, reusing and repurposing existing workflows to build new scientific experiments is still a daunting task. This is partly due to the difficulty that scientists experience when attempting to understand existing workflows, which contain several data preparation and adaptation steps in addition to the scientifically significant analysis steps. One way to tackle the understandability problem is through providing abstractions that give a high-level view of activities undertaken within workflows. As a first step towards abstractions, we report in this paper on the results of a manual analysis performed over a set of real-world scientific workflows from Taverna and Wings systems. Our analysis has resulted in a set of scientific workflow motifs that outline i) the kinds of data intensive activities that are observed in workflows (data oriented motifs), and ii) the different manners in which activities are implemented within workflows (workflow oriented motifs). These motifs can be useful to inform workflow designers on the good and bad practices for workflow development, to inform the design of automated tools for the generation of workflow abstractions, etc. I. INTRODUCTION Scientific workflows have been increasingly used in the last decade as an instrument for data intensive scientific analysis. In these settings, workflows serve a dual function: first as detailed documentation of the method (i. e. the input sources and processing steps taken for the derivation of a certain data item) and second as re-usable, executable artifacts for data-intensive analysis. Workflows stitch together a variety of data manipulation activities such as data movement, data transformation or data visualization to serve the goals of the scientific study. The stitching is realized by the constructs made available by the workflow system used and is largely shaped by the environment in which the system operates and the function undertaken by the workflow. A variety of workflow systems are in use [10] [3] [7] [2] serving several scientific disciplines. A workflow is a software [14] and CrowdLabs [8] have made publishing and finding workflows easier, but scientists still face the challenges of re- use, which amounts to fully understanding and exploiting the available workflows/fragments. One difficulty in understanding workflows is their complex nature. A workflow may contain several scientifically-significant analysis steps, combined with various other data preparation activities, and in different implementation styles depending on the environment and context in which the workflow is executed. The difficulty in understanding causes workflow developers to revert to starting from scratch rather than re-using existing fragments. Through an analysis of the current practices in scientific workflow development, we could gain insights on the creation of understandable and more effectively re-usable workflows. Specifically, we propose an analysis with the following objec- tives: 1) To reverse-engineer the set of current practices in work- flow development through an analysis of empirical evi- dence. 2) To identify workflow abstractions that would facilitate understandability and therefore effective re-use. 3) To detect potential information sources and heuristics that can be used to inform the development of tools for creating workflow abstractions. In this paper we present the result of an empirical analysis performed over 177 workflow descriptions from Taverna [10] and Wings [3]. Based on this analysis, we propose a catalogue of scientific workflow motifs. Motifs are provided through i) a characterization of the kinds of data-oriented activities that are carried out within workflows, which we refer to as data- oriented motifs, and ii) a characterization of the different man- ners in which those activity motifs are realized/implemented within workflows, which we refer to as workflow-oriented motifs. It is worth mentioning that, although important, motifs Fig. 3. Distribution of Data-Oriented Motifs per domain Fig. 3. Distribution of Data-Oriented Motifs per domain Fig. 5. Data Preparation Motifs in the Genomics Workflows
  4. 4. We do this repeatedly for the same datasets!
  5. 5. Big datasets… • NAPP, Mosaic, IPUMS etc. solve this for large datasets • But this is very expensive • And the results are not mutually compatible • Or worse… the compatibility is contested
  6. 6. What QBer does… • Empower individual researchers to • Code and harmonize individual datasets according to best practices of the community (e.g. HISCO, SDMX, Worldbank, etc.) or against their colleagues • Share their own code lists with fellow researchers • Align code lists across datasets • Publish their standards-compliant datasets on a Structured Data Hub We use web-based linked data to grow a giant graph of interconnected datasets
  7. 7. QBer’s Architecture Exists Frequency Table Variabele does not yet existVariable Mappings Publish Harmonize Includes both external Linked Data and standard vocabularies, e.g. World Bank Structured Data Hub External Data Existing Variables Provenance tracking of all data Legacy Systems Browse
  8. 8. Screencast https://vimeo.com/130322985
  9. 9. What you just saw • Uploading of micro data dataset and extraction of variables and value frequencies • Gleaning of known variables and code lists from the Web • Mapping of variable values to codes (while preserving the originals!) • Publishing of dataset structure as Linked Data • Provenance of all assertions to the SDH traceable to time and person • Collaborative growing of a graph of interconnected datasets
  10. 10. Future benefits • Automatic extraction of interesting data across datasets • Opportunities for large scale cross-dataset studies • Crowd-based production of code lists and mappings • Reuse other people’s work (or stand on the shoulders of giants) • No disposable research

×