Linked Census Data
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Linked Census Data

on

  • 1,134 views

Talk about the use of Linked Data in historical research on census data. Has some slides about TabLInker as well (http://github.com/Data2Semantics/TabLinker). Part of the data2semantics project ...

Talk about the use of Linked Data in historical research on census data. Has some slides about TabLInker as well (http://github.com/Data2Semantics/TabLinker). Part of the data2semantics project (http://data2semantics.org)

Statistics

Views

Total Views
1,134
Views on SlideShare
1,134
Embed Views
0

Actions

Likes
5
Downloads
14
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Linked Census Data Presentation Transcript

  • 1. Linked Census Data Rinke Hoekstra CEDAR Kickoff, 26 January 2012donderdag 26 januari 12
  • 2. Overview “Can Linked Data make a difference for historical analysis?” Problem Procedure (as I understand it) Step-by-step Vocabularies, tools Conclusiondonderdag 26 januari 12
  • 3. Problem ~519 Excel spreadsheets (more?... I heard 1200) Want to do analysis over time and space, but... Structure Excel sheets cannot be readily imported in a database Contents Excel sheets are not normalised (age) nor harmonised (occupations/places) Excel sheets contain errors (both original and data-entry) Want to preserve all stages of data cleansing/harmonisationdonderdag 26 januari 12
  • 4. Procedure Verbatim import of sheets to Archiving database/triple store Correcting/ Add missing information (headers) Documenting Interpreting Add corrected information (data) Normalising Interpret and correct objective information Link information across sheets Harmonising Link information to other datasets (e.g. locations) Visualising Build (generic) visualisations of resultsdonderdag 26 januari 12
  • 5. ... a bit about Linked Data “Just another Data Model” RDF ≠ Ontology (OWL) RDF ≠ Taxonomy (RDFS/SKOS) Globally Unique Identifiers (URI) for all entities Dereferencable on the Web (URI = URL) HTTP-accessible databases (triple stores, SPARQL) Triples all the way <subject,  predicate,  object>donderdag 26 januari 12
  • 6. Spreadsheet ≠ Database Primary Keys are entities Column names are attributes Cell values are attribute values Secondary keys are relations to other entitiesdonderdag 26 januari 12
  • 7. Spreadsheet ≠ Database Primary Keys are entities Column names are attributes Cell values are attribute values Secondary keys are relations to other entitiesdonderdag 26 januari 12
  • 8. Spreadsheet ≠ Database Primary Keys are entities Column names are attributes Cell values are attribute values Secondary keys are relations to other entitiesdonderdag 26 januari 12
  • 9. Spreadsheet ≠ Database No Primary Keys! Anything can be an entity Column headers are “types” Row headers are “types” Hierarchies! Cell values are entity “values” No relations to other entitiesdonderdag 26 januari 12
  • 10. Anatomy of a Spreadsheet Workbook Cell Cell Cell Sheet Cell Cell Cell Cell Cell Cell Cell Cell Cell Sheet Cell Cell Cell Cell Cell Celldonderdag 26 januari 12
  • 11. Anatomy of a Spreadsheet Workbook1.xls Sheet1:A1 Sheet1:B1 Sheet1:C1 Sheet1 Sheet1:A2 Sheet1:B2 Sheet1:C2 ... ... ... Sheet2:A1 Sheet2:B1 Sheet2:C1 Sheet2 Sheet2:A2 Sheet2:B2 Sheet2:C2 ... ... ...donderdag 26 januari 12
  • 12. Anatomy of a Spreadsheet Workbook1.xls workers agriculture 12 Sheet1 industry 6 ... ... diamond A 34 cutters Sheet2 B 67 ... ... ...donderdag 26 januari 12
  • 13. Anatomy of a Spreadsheet Workbook1.xls workers agriculture 12 Sheet1 industry 6 ... ... diamond A 34 cutters Sheet2 B 67 ... ... ... NB: all URIs scoped to sheet!donderdag 26 januari 12
  • 14. Data Cube How to best represent numeric data, in a flexible way? SDMX (Eurostat, World Bank, CBS, etc.) Every data item is an observation Every observation has a value Every observation has one or more dimensionsdonderdag 26 januari 12
  • 15. Data Cube How to best represent numeric data, in a flexible way? SDMX (Eurostat, World Bank, CBS, etc.) Every data item is an observation Every observation has a value Every observation has one or more dimensionsdonderdag 26 januari 12
  • 16. Data Cube How to best represent numeric data, in a flexible way? 12 1878 SDMX (Eurostat, World Bank, CBS, etc.) M O I leeftijd nummer der beroepsklasse geboortejaar Every data item is an observation geslacht huwelijkse staat E pannenbakkers Every observation has a value beroep positie D 1 Every observation has one or more dimensions letter der beroepsklassedonderdag 26 januari 12
  • 17. Data Cube How to best represent numeric data, in a flexible way? 12 1878 SDMX (Eurostat, World Bank, CBS, etc.) M O I leeftijd ? nummer der beroepsklasse ? geboortejaar Every data item is an observation ? geslacht ? huwelijkse staat E pannenbakkers Every observation has a value beroep positie D 1 Every observation has one or more dimensions letter der beroepsklassedonderdag 26 januari 12
  • 18. Anatomy of a Spreadsheet Properties Headers RowHeaders Datadonderdag 26 januari 12
  • 19. Anatomy of a Spreadsheet Properties Headers RowHeaders Datadonderdag 26 januari 12
  • 20. Anatomy of a Spreadsheet Properties Headers RowHeaders Data http://github.com/Data2Semantics/TabLinkerdonderdag 26 januari 12
  • 21. :I "1"^^xsd:int skos:broader :Nummer_der_beroepsklasse d2s:populationSize :I/E :Letter__Onderdeel_beroepsklasse_ _:x d2s:dimension :14--15_1875--1874 d2s:dimension skos:broader :M :BENAMING_van_de_onderdeelen_der_onderscheidene_beroepsklassen__met_de_daartoe_behoorende_beroepen d2s:dimension :O :Positie_in_het_beroep__aangeduid_met_A__B__C_of_D Sheet1:I/E/Fabricage_van_dakpannen__pannenbakkers :D Sheet1:D15donderdag 26 januari 12
  • 22. d2s:HierarchicalRowHeader d2s:DataCell d2s:Header rdf:type rdf:type rdf:type rdf:type rdf:type rdf:type rdf:type Sheet1:E15 Sheet1:C14 Sheet1:B8 Sheet1:L15 Sheet1:L3 Sheet1:L4 Sheet1:L5 d2s:isDimension :I d2s:isDimension "1"^^xsd:int d2s:isObservation d2s:isDimension skos:broader :Nummer_der_beroepsklasse d2s:isDimension d2s:populationSized2s:isDimension :I/E :Letter__Onderdeel_beroepsklasse_ _:x d2s:dimension :14--15_1875--1874 d2s:isDimension d2s:dimension skos:broader :M :BENAMING_van_de_onderdeelen_der_onderscheidene_beroepsklassen__met_de_daartoe_behoorende_beroepen d2s:dimension :Regelnummer :O :Positie_in_het_beroep__aangeduid_met_A__B__C_of_D d2s:dimension Sheet1:I/E/Fabricage_van_dakpannen__pannenbakkers :D :5 :10 d2s:isDimension d2s:isDimension d2s:isDimension Sheet1:F15 Sheet1:D15 Sheet1:L6 rdf:type rdf:type rdf:type d2s:RowHeader d2s:Metadatadonderdag 26 januari 12
  • 23. What TabLinker can’t do Annotations “footnote”-style on separate sheet Interpret functions e.g. automatic sums Integrate/harmonise across sheets/files Additional useful functionality: “checksum” functionality Export to database tablesdonderdag 26 januari 12
  • 24. Normalising & Correcting "1"^^xsd:int d2s:populationSize _:x d2s:dimension :14--15_1875--1874donderdag 26 januari 12
  • 25. Normalising & Correcting "1"^^xsd:int "1"^^xsd:int "11"^^xsd:int d2s:populationSize d2s:populationSize d2s:populationSize "1889"^^xsd:int d2s:censusYear _:x _:x d2s:birthYears :1875--1874 d2s:gemeente d2s:dimension d2s:dimension d2s:ageGroup :Assendelft :14--15_1875--1874 :14--15_1875--1874 :14-15donderdag 26 januari 12
  • 26. Documenting <http://example.com/workbook1/sheet1> <http://example.com/workbook1/sheet1/corrected> provo:Activity rdf:type :curation20120126 "1"^^xsd:int "11"^^xsd:int provo:wasGeneratedBy provo:hadAgent provo:startedAt d2s:populationSize d2s:populationSize provo:endedAt "1889"^^xsd:int :RinkeHoekstra d2s:censusYear _:x d2s:birthYears :1875--1874 _:b _:a d2s:gemeente d2s:dimension d2s:ageGroup time:inXSDDateTime time:inXSDDateTime :Assendelft :14--15_1875--1874 :14-15 "20120126T09:00:00" "20120126T08:30:00" http://www.w3.org/TR/prov-o/donderdag 26 januari 12
  • 27. Harmonising I skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.)donderdag 26 januari 12
  • 28. Harmonising I skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) skos:exactMatch skos:broadMatch skos:broadMatch skos:closeMatch skos:exactMatch skos:exactMatch skos:exactMatch HISCO:23811 HISCO:25281 HISCO:25281 HISCO:26345 HISCO:23810 HISCO:25281 HISCO:26340donderdag 26 januari 12
  • 29. Harmonising I skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) Sheet1:I skos:broader skos:broader skos:broader Sheet1:D Sheet1:E Sheet1:A skos:broader skos:broader skos:broader skos:broader Sheet1:Fabricage van Sheet1:Fabricage van steen Sheet1:Fabricage van aardewerk (incl. Sheet1:Fabricage (molensteen, steenbakkers, dakpannen porcelein, terracotta, van kalk tegelbakkers) (pannenbakkers) kachelbakkers, pottenbakkers, enz.)donderdag 26 januari 12
  • 30. I skos:broader skos:broader skos:broader D E A 1889 skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) skos:narrowMatch I skos:closeMatch skos:exactMatch skos:narrowMatch skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader 1899 skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (steenbakkers, porcelein, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.)donderdag 26 januari 12
  • 31. I Is SKOS sufficient? skos:broader skos:broader skos:broader D E A 1889 skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) skos:narrowMatch I skos:closeMatch skos:exactMatch skos:narrowMatch skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader 1899 skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (steenbakkers, porcelein, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) NB: These are not strings, but globally unique URIs, scoped within their spreadsheet (graph!) of origin.donderdag 26 januari 12
  • 32. I Is SKOS sufficient? skos:broader skos:broader skos:broader D E A 1889 skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) skos:narrowMatch I skos:closeMatch skos:exactMatch skos:narrowMatch skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader 1899 skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (steenbakkers, porcelein, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) NB: These are not strings, but globally unique URIs, scoped within their spreadsheet (graph!) of origin.donderdag 26 januari 12
  • 33. I Is SKOS sufficient? skos:broader skos:broader skos:broader D E A 1889 skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) skos:narrowMatch I skos:closeMatch skos:exactMatch skos:narrowMatch skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader 1899 skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (steenbakkers, porcelein, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) NB: These are not strings, but globally unique URIs, scoped within their spreadsheet (graph!) of origin.donderdag 26 januari 12
  • 34. Vocabularies, Tools Vocabularies Data Cube, SKOS, W3C Time, PROV-O Excel + TabLinker Semi-automatic conversion of Excel sheets to RDF ProvTracer Create PROV-O provenance trail for shell/python scripts Visualization Prototype SGVizler (SPARQL + Google Graph API)donderdag 26 januari 12
  • 35. Discussion Advantages of Linked Data approach Straightforward transformation from spreadsheets Seamless integration of original, corrected and harmonised data Ingestion of external (linked) data Powerful documentation (provenance) Everything is transparently query-able (SPARQL) .... on the Webdonderdag 26 januari 12
  • 36. Discussion Disadvantages of Linked Data approach (subject to research) Size? (300k * 519 sheets = 156M triples) Only rudimentary support for arithmetical operations in queries No dynamic/conditional ‘view’-like graphsdonderdag 26 januari 12
  • 37. SPARQL vs. SQL? Middle ground? Expose database through D2RQdonderdag 26 januari 12
  • 38. Findonderdag 26 januari 12