Data analysis requires clean data. But this comes with a price, and building a ready-to use clean dataset involves careful interpretation of often messy, incomplete and incorrect data, where values and variables are replaced with standard terms (coding) and units of measure. For analyses that rely on multiple datasets, a further data-harmonization step is needed. This is time and effort consuming work, and studies show that in some domains this 'data preparation' step can take up to 60% of the total work. To make matters worse, every individual researcher does this, every time a new dataset is studied.
To overcome this problem, important and big datasets are carefully curated and published in a standard, well documented form. Unfortunately there remain three problems: 1) this is very expensive, and 2) is therefore only done for the larger datasets, and 2) the various efforts are not necessarily mutually compatible.
For these reasons, we are developing QBer. A tool that will allow individual researchers to easily 1) code and harmonize their datasets according to best practices of the community, 2) share new code lists with fellow researchers, 3) align code lists across datasets, and 3) publish their datasets in the standards-compliant format on a Structured Data Hub. By reusing identifiers (codes, standard terms) across datasets, we will grow a large volume of interconnected datasets that are directly ready for use in analyses.