Linked Census Data                                    Rinke Hoekstra                               CEDAR Kickoff, 26 Januar...
Overview             “Can Linked Data make a difference for historical analysis?”              Problem              Proced...
Problem              ~519 Excel spreadsheets (more?... I heard 1200)              Want to do analysis over time and space,...
Procedure                                  Verbatim import of sheets to            Archiving                database/tripl...
... a bit about Linked Data              “Just another Data Model”              RDF ≠ Ontology (OWL)              RDF ≠ Ta...
Spreadsheet ≠ Database                          Primary Keys are entities                          Column names are attrib...
Spreadsheet ≠ Database                          Primary Keys are entities                          Column names are attrib...
Spreadsheet ≠ Database                          Primary Keys are entities                          Column names are attrib...
Spreadsheet ≠ Database                          No Primary Keys!                          Anything can be an entity       ...
Anatomy of a Spreadsheet                          Workbook                                       Cell   Cell   Cell       ...
Anatomy of a Spreadsheet                          Workbook1.xls                                          Sheet1:A1   Sheet...
Anatomy of a Spreadsheet                          Workbook1.xls                                          workers    agricu...
Anatomy of a Spreadsheet                          Workbook1.xls                                          workers    agricu...
Data Cube              How to best represent numeric data, in a flexible way?              SDMX (Eurostat, World Bank, CBS,...
Data Cube              How to best represent numeric data, in a flexible way?              SDMX (Eurostat, World Bank, CBS,...
Data Cube              How to best represent numeric data, in a flexible way?                                              ...
Data Cube              How to best represent numeric data, in a flexible way?                                              ...
Anatomy of a Spreadsheet                           Properties   Headers                          RowHeaders     Datadonder...
Anatomy of a Spreadsheet                           Properties   Headers                          RowHeaders     Datadonder...
Anatomy of a Spreadsheet                                      Properties      Headers                                     ...
:I                                                                                                                        ...
d2s:HierarchicalRowHeader                                                                                   d2s:DataCell  ...
What TabLinker can’t do              Annotations              “footnote”-style on separate sheet              Interpret fu...
Normalising & Correcting                             "1"^^xsd:int                           d2s:populationSize            ...
Normalising & Correcting                             "1"^^xsd:int          "1"^^xsd:int                             "11"^^...
Documenting        <http://example.com/workbook1/sheet1>      <http://example.com/workbook1/sheet1/corrected>             ...
Harmonising                                                                                  I                            ...
Harmonising                                                                                        I                      ...
Harmonising                                                                                             I                 ...
I                                                                                              skos:broader               ...
I                                                                                                                         ...
I                                                                                                                         ...
I                                                                                                                         ...
Vocabularies, Tools              Vocabularies              Data Cube, SKOS, W3C Time, PROV-O              Excel + TabLinke...
Discussion              Advantages of Linked Data approach                    Straightforward transformation from spreadsh...
Discussion              Disadvantages of Linked Data approach (subject to research)                    Size? (300k * 519 s...
SPARQL vs. SQL?              Middle ground?              Expose database through D2RQdonderdag 26 januari 12
Findonderdag 26 januari 12
Upcoming SlideShare
Loading in...5
×

Linked Census Data

1,025

Published on

Talk about the use of Linked Data in historical research on census data. Has some slides about TabLInker as well (http://github.com/Data2Semantics/TabLinker). Part of the data2semantics project (http://data2semantics.org)

Published in: Technology

Linked Census Data

  1. 1. Linked Census Data Rinke Hoekstra CEDAR Kickoff, 26 January 2012donderdag 26 januari 12
  2. 2. Overview “Can Linked Data make a difference for historical analysis?” Problem Procedure (as I understand it) Step-by-step Vocabularies, tools Conclusiondonderdag 26 januari 12
  3. 3. Problem ~519 Excel spreadsheets (more?... I heard 1200) Want to do analysis over time and space, but... Structure Excel sheets cannot be readily imported in a database Contents Excel sheets are not normalised (age) nor harmonised (occupations/places) Excel sheets contain errors (both original and data-entry) Want to preserve all stages of data cleansing/harmonisationdonderdag 26 januari 12
  4. 4. Procedure Verbatim import of sheets to Archiving database/triple store Correcting/ Add missing information (headers) Documenting Interpreting Add corrected information (data) Normalising Interpret and correct objective information Link information across sheets Harmonising Link information to other datasets (e.g. locations) Visualising Build (generic) visualisations of resultsdonderdag 26 januari 12
  5. 5. ... a bit about Linked Data “Just another Data Model” RDF ≠ Ontology (OWL) RDF ≠ Taxonomy (RDFS/SKOS) Globally Unique Identifiers (URI) for all entities Dereferencable on the Web (URI = URL) HTTP-accessible databases (triple stores, SPARQL) Triples all the way <subject,  predicate,  object>donderdag 26 januari 12
  6. 6. Spreadsheet ≠ Database Primary Keys are entities Column names are attributes Cell values are attribute values Secondary keys are relations to other entitiesdonderdag 26 januari 12
  7. 7. Spreadsheet ≠ Database Primary Keys are entities Column names are attributes Cell values are attribute values Secondary keys are relations to other entitiesdonderdag 26 januari 12
  8. 8. Spreadsheet ≠ Database Primary Keys are entities Column names are attributes Cell values are attribute values Secondary keys are relations to other entitiesdonderdag 26 januari 12
  9. 9. Spreadsheet ≠ Database No Primary Keys! Anything can be an entity Column headers are “types” Row headers are “types” Hierarchies! Cell values are entity “values” No relations to other entitiesdonderdag 26 januari 12
  10. 10. Anatomy of a Spreadsheet Workbook Cell Cell Cell Sheet Cell Cell Cell Cell Cell Cell Cell Cell Cell Sheet Cell Cell Cell Cell Cell Celldonderdag 26 januari 12
  11. 11. Anatomy of a Spreadsheet Workbook1.xls Sheet1:A1 Sheet1:B1 Sheet1:C1 Sheet1 Sheet1:A2 Sheet1:B2 Sheet1:C2 ... ... ... Sheet2:A1 Sheet2:B1 Sheet2:C1 Sheet2 Sheet2:A2 Sheet2:B2 Sheet2:C2 ... ... ...donderdag 26 januari 12
  12. 12. Anatomy of a Spreadsheet Workbook1.xls workers agriculture 12 Sheet1 industry 6 ... ... diamond A 34 cutters Sheet2 B 67 ... ... ...donderdag 26 januari 12
  13. 13. Anatomy of a Spreadsheet Workbook1.xls workers agriculture 12 Sheet1 industry 6 ... ... diamond A 34 cutters Sheet2 B 67 ... ... ... NB: all URIs scoped to sheet!donderdag 26 januari 12
  14. 14. Data Cube How to best represent numeric data, in a flexible way? SDMX (Eurostat, World Bank, CBS, etc.) Every data item is an observation Every observation has a value Every observation has one or more dimensionsdonderdag 26 januari 12
  15. 15. Data Cube How to best represent numeric data, in a flexible way? SDMX (Eurostat, World Bank, CBS, etc.) Every data item is an observation Every observation has a value Every observation has one or more dimensionsdonderdag 26 januari 12
  16. 16. Data Cube How to best represent numeric data, in a flexible way? 12 1878 SDMX (Eurostat, World Bank, CBS, etc.) M O I leeftijd nummer der beroepsklasse geboortejaar Every data item is an observation geslacht huwelijkse staat E pannenbakkers Every observation has a value beroep positie D 1 Every observation has one or more dimensions letter der beroepsklassedonderdag 26 januari 12
  17. 17. Data Cube How to best represent numeric data, in a flexible way? 12 1878 SDMX (Eurostat, World Bank, CBS, etc.) M O I leeftijd ? nummer der beroepsklasse ? geboortejaar Every data item is an observation ? geslacht ? huwelijkse staat E pannenbakkers Every observation has a value beroep positie D 1 Every observation has one or more dimensions letter der beroepsklassedonderdag 26 januari 12
  18. 18. Anatomy of a Spreadsheet Properties Headers RowHeaders Datadonderdag 26 januari 12
  19. 19. Anatomy of a Spreadsheet Properties Headers RowHeaders Datadonderdag 26 januari 12
  20. 20. Anatomy of a Spreadsheet Properties Headers RowHeaders Data http://github.com/Data2Semantics/TabLinkerdonderdag 26 januari 12
  21. 21. :I "1"^^xsd:int skos:broader :Nummer_der_beroepsklasse d2s:populationSize :I/E :Letter__Onderdeel_beroepsklasse_ _:x d2s:dimension :14--15_1875--1874 d2s:dimension skos:broader :M :BENAMING_van_de_onderdeelen_der_onderscheidene_beroepsklassen__met_de_daartoe_behoorende_beroepen d2s:dimension :O :Positie_in_het_beroep__aangeduid_met_A__B__C_of_D Sheet1:I/E/Fabricage_van_dakpannen__pannenbakkers :D Sheet1:D15donderdag 26 januari 12
  22. 22. d2s:HierarchicalRowHeader d2s:DataCell d2s:Header rdf:type rdf:type rdf:type rdf:type rdf:type rdf:type rdf:type Sheet1:E15 Sheet1:C14 Sheet1:B8 Sheet1:L15 Sheet1:L3 Sheet1:L4 Sheet1:L5 d2s:isDimension :I d2s:isDimension "1"^^xsd:int d2s:isObservation d2s:isDimension skos:broader :Nummer_der_beroepsklasse d2s:isDimension d2s:populationSized2s:isDimension :I/E :Letter__Onderdeel_beroepsklasse_ _:x d2s:dimension :14--15_1875--1874 d2s:isDimension d2s:dimension skos:broader :M :BENAMING_van_de_onderdeelen_der_onderscheidene_beroepsklassen__met_de_daartoe_behoorende_beroepen d2s:dimension :Regelnummer :O :Positie_in_het_beroep__aangeduid_met_A__B__C_of_D d2s:dimension Sheet1:I/E/Fabricage_van_dakpannen__pannenbakkers :D :5 :10 d2s:isDimension d2s:isDimension d2s:isDimension Sheet1:F15 Sheet1:D15 Sheet1:L6 rdf:type rdf:type rdf:type d2s:RowHeader d2s:Metadatadonderdag 26 januari 12
  23. 23. What TabLinker can’t do Annotations “footnote”-style on separate sheet Interpret functions e.g. automatic sums Integrate/harmonise across sheets/files Additional useful functionality: “checksum” functionality Export to database tablesdonderdag 26 januari 12
  24. 24. Normalising & Correcting "1"^^xsd:int d2s:populationSize _:x d2s:dimension :14--15_1875--1874donderdag 26 januari 12
  25. 25. Normalising & Correcting "1"^^xsd:int "1"^^xsd:int "11"^^xsd:int d2s:populationSize d2s:populationSize d2s:populationSize "1889"^^xsd:int d2s:censusYear _:x _:x d2s:birthYears :1875--1874 d2s:gemeente d2s:dimension d2s:dimension d2s:ageGroup :Assendelft :14--15_1875--1874 :14--15_1875--1874 :14-15donderdag 26 januari 12
  26. 26. Documenting <http://example.com/workbook1/sheet1> <http://example.com/workbook1/sheet1/corrected> provo:Activity rdf:type :curation20120126 "1"^^xsd:int "11"^^xsd:int provo:wasGeneratedBy provo:hadAgent provo:startedAt d2s:populationSize d2s:populationSize provo:endedAt "1889"^^xsd:int :RinkeHoekstra d2s:censusYear _:x d2s:birthYears :1875--1874 _:b _:a d2s:gemeente d2s:dimension d2s:ageGroup time:inXSDDateTime time:inXSDDateTime :Assendelft :14--15_1875--1874 :14-15 "20120126T09:00:00" "20120126T08:30:00" http://www.w3.org/TR/prov-o/donderdag 26 januari 12
  27. 27. Harmonising I skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.)donderdag 26 januari 12
  28. 28. Harmonising I skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) skos:exactMatch skos:broadMatch skos:broadMatch skos:closeMatch skos:exactMatch skos:exactMatch skos:exactMatch HISCO:23811 HISCO:25281 HISCO:25281 HISCO:26345 HISCO:23810 HISCO:25281 HISCO:26340donderdag 26 januari 12
  29. 29. Harmonising I skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) Sheet1:I skos:broader skos:broader skos:broader Sheet1:D Sheet1:E Sheet1:A skos:broader skos:broader skos:broader skos:broader Sheet1:Fabricage van Sheet1:Fabricage van steen Sheet1:Fabricage van aardewerk (incl. Sheet1:Fabricage (molensteen, steenbakkers, dakpannen porcelein, terracotta, van kalk tegelbakkers) (pannenbakkers) kachelbakkers, pottenbakkers, enz.)donderdag 26 januari 12
  30. 30. I skos:broader skos:broader skos:broader D E A 1889 skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) skos:narrowMatch I skos:closeMatch skos:exactMatch skos:narrowMatch skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader 1899 skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (steenbakkers, porcelein, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.)donderdag 26 januari 12
  31. 31. I Is SKOS sufficient? skos:broader skos:broader skos:broader D E A 1889 skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) skos:narrowMatch I skos:closeMatch skos:exactMatch skos:narrowMatch skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader 1899 skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (steenbakkers, porcelein, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) NB: These are not strings, but globally unique URIs, scoped within their spreadsheet (graph!) of origin.donderdag 26 januari 12
  32. 32. I Is SKOS sufficient? skos:broader skos:broader skos:broader D E A 1889 skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) skos:narrowMatch I skos:closeMatch skos:exactMatch skos:narrowMatch skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader 1899 skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (steenbakkers, porcelein, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) NB: These are not strings, but globally unique URIs, scoped within their spreadsheet (graph!) of origin.donderdag 26 januari 12
  33. 33. I Is SKOS sufficient? skos:broader skos:broader skos:broader D E A 1889 skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) skos:narrowMatch I skos:closeMatch skos:exactMatch skos:narrowMatch skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader 1899 skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (steenbakkers, porcelein, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) NB: These are not strings, but globally unique URIs, scoped within their spreadsheet (graph!) of origin.donderdag 26 januari 12
  34. 34. Vocabularies, Tools Vocabularies Data Cube, SKOS, W3C Time, PROV-O Excel + TabLinker Semi-automatic conversion of Excel sheets to RDF ProvTracer Create PROV-O provenance trail for shell/python scripts Visualization Prototype SGVizler (SPARQL + Google Graph API)donderdag 26 januari 12
  35. 35. Discussion Advantages of Linked Data approach Straightforward transformation from spreadsheets Seamless integration of original, corrected and harmonised data Ingestion of external (linked) data Powerful documentation (provenance) Everything is transparently query-able (SPARQL) .... on the Webdonderdag 26 januari 12
  36. 36. Discussion Disadvantages of Linked Data approach (subject to research) Size? (300k * 519 sheets = 156M triples) Only rudimentary support for arithmetical operations in queries No dynamic/conditional ‘view’-like graphsdonderdag 26 januari 12
  37. 37. SPARQL vs. SQL? Middle ground? Expose database through D2RQdonderdag 26 januari 12
  38. 38. Findonderdag 26 januari 12
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×