Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Structured Data Hub in 2019


Published on

This presenations provides an outlook of what we anticipate with the structured data hub: to create linkable datasets, enhance the use of provenance, add quality flags to data, answer new questions and finally, borrow from and provide to public sources such as dbpedia

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

The Structured Data Hub in 2019

  1. 1. The Structured Data Hub Today’s fiction, 2019’s reality
  2. 2. Status quo Many datasets currently live in isolation. They are stored on people’s computers and are not findable. Moreover little effort is given to link such datasets. When data is being linked, it requires cleaning and harmonising the datasets, which is very time intensive. More importantly, such linkage efforts are seldom shared, literally providing ‘disposable research’
  3. 3. What we envisage Is to select core micro, meso and macro datasets from the field of economic and social history and create a structured data hub from those.
  4. 4. What we envisage Structured Data Hub Your data Tooling WWW Next to allow you to connect your data and allow you to build such connections yourself, while we will ensure your data is findable and linkable to other datasets on the (semantic) world wide web.
  5. 5. The Structured Data Hub A place to store data augment data link data find data ask questions! (for data analysis and visualization) So, the structure data hub is a place to …. Now let’s go into more detail for some of these aspects.
  6. 6. Data augmentation A first feature of the Structured Data Hub, is augmentation. With augmentation we refer to the process of enhancing your data with core variables from social, demographic and economic sciences.
  7. 7. For example, think of this datasets containing individual characteristics, including occupation and HISCO code. If we wanted to know whether these person were incumbents of high or low occupations we would needed to add a stratification measure.
  8. 8. Here, we add the universal HISCAM scale, but any other HISCO based stratification scale or class measure can be added.
  9. 9. We might also be interested in the area where people are working, here indicated by the place variable. If we wanted to map such values, or calculate distances between these places, we would need information on the latitude and longitude.
  10. 10. Another type of data augmentation concerns the application of basic calculus to derive new variables. Income for example, is seldom analysed in its raw form, and is often rescaled using a log transformation.
  11. 11. The Structured Data Hub facilitates in the creation and documentation of such newly derived variables.
  12. 12. Provenance tracking A second feature of the Data Hub is traceable provenance. Currently bigger datasets such as Clio-Infra consists of a core part derived from a bigger statistical agency, combined with many smaller datasets as well as ‘corrections’ of the data by the researcher. After an iteration it is hard to track who contributed what, or which number was changed by whom for what reason. We therefore present provenance tracking.
  13. 13. version 2version 1 activity =+ The basic formula for provenance we use is that one version leads to the next as the result of an activity.
  14. 14. activity who when what how For proper provenance it is crucial to describe this activity, at least in the terms of what the activity entailed, how the activity was performed, by whom and in which time period.
  15. 15. surname occupa+on Fumes cigar maker Bridges civil engineer Moves dancer Bones undertaker New PID!PID: ab.123 PID: bc.789 - added occupation Bones - from Gravediggers Vol II - 2015-12-09A09:30:17 - dai:richard.zijdeman surname occupa+on Fumes cigar maker Bridges civil engineer Moves dancer Bones In this example, the occupation for ‘Bones’ is added, which leads to a new version of the data, hence a new PID. Moreover, the action of adding the value for occupation, is provided with as provenance.
  16. 16. Quality flags An important aspect to consider when combining data is that datasets will come in various forms of quality.
  17. 17. Quality flags Allow for quality flags of content e.g. created by scientists e.g. peer reviewed (by scientist) created by public and peer reviewed We will design a system in which datasets will be accompanied by a ‘quality flag’, an indicator of the trustworthiness of the dataset. This might involve simple reputation effects, but could also provide more enhanced features, like whether other data confirms the values in this datasets. Work together with sestet on this
  18. 18. Basic visualisation Focus on visual exploration of data and results ‘Ask’ question and get visual output: e.g. bar, line graph etc. get output on map or even as ‘movie’ A final feature that we want to highlight here is to ask questions and receive a ‘visual’ answer. Data visualisations are increasingly present in all sorts of media and our hub will allow for such visualisations to answer basic questions on historical patterns.
  19. 19. To society and back
  20. 20. From Science to Society and back Provide data to public: ‘enthusiasts’, journalists Have enthusiasts add data to the hub (creating linked data): e.g. stucadoors dataset, harbour datasets, railway datasets, etc. And back: link scientific data to crowd-projects like dpbedia: enhance occupations with descriptions The last point we want to make about the structured data hub, is that it is not just for academics, but we provide our tools for a broader audience too. This means that we assume a lowish level of knowledge of history and technical skills. However, we also believe, that ‘the public’ is making quite interesting datasets from which we may borrow, as well as may give back to, by enriching those with scientific knowledge.