Semantic Representations for Research

  • 1,036 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,036
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
11
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Semantic Representations for Research Rinke Hoekstra and Stefan Schlobach VU University Amsterdam/University of Amsterdam http://www.data2semantics.org
  • 2. About us...• Knowledge Representation and Reasoning Group Frank van Harmelen• Modeling of complex domains• Querying and reasoning over these models• ... at a very large scale (the Web)
  • 3. About us...• Knowledge Representation and Reasoning Group Frank van Harmelen• Experience a.o. CATCH, STICH, LarKC, CEDAR and Data2Semantics• Premier group for provenance and linked data at scale
  • 4. Overview• Research Lifecycle Data2Semantics and LarKC• Historical Census Data CEDAR and Data2Semantics• Short Title Catalogue of The Netherlands (STCN) Inger Leemans, Fernie Maas, Paul Huygen, Albert Meroño-Peñuela
  • 5. How to share, publish, access, analyse, interpret and reuse data? Increase the ease of sharing scientific data ... ... of accessing, analysing and interpreting data ... ... and thereby increasing the reuse of data
  • 6. EASY Data RepositoryEnrich datasets: census data
  • 7. EASY Data RepositoryEnrich datasets: census dataLarge volumes of publicationsImprove services to clientsAutomated services
  • 8. EASY Data RepositoryEnrich datasets: census dataLarge volumes of publicationsImprove services to clientsAutomated servicesBuild systems for hospitals
  • 9. EASY Data RepositoryEnrich datasets: census dataLarge volumes of publicationsImprove services to clientsAutomated servicesBuild systems for hospitals
  • 10. Linked Data• “Semantic Hyperlinks” between data items• Every data item has a global identifier ...• ... that looks like a web address (URI) ...• ... is linked and described using shared vocabularies• Resource Description Framework (RDF)• SPARQL query language & endpoint
  • 11. Linked Data Linked LOV User Slideshare tags2con Audio Feedback 2RDF delicious Moseley Scrobbler Bricklink Sussex Folk (DBTune) Reading St. GTAA Magna- Lists Andrews Klapp- tune stuhl- Resource NTU DB club Lists Resource Tropes Lotico Semantic yovisto John Music Man- Lists Music Tweet chester Hellenic Peel Brainz NDL (DBTune) (Data Brainz Reading subjects FBD (zitgist) Lists Open EUTC Incubator) Linked Hellenic Library Open t4gm Produc- Crunch- PD Surge RDF info tions Discogs base Library Radio Ontos Source Code Crime ohloh Plymouth (Talis) (Data News LEM Ecosystem Reading RAMEAU Reports business Incubator) Crime data.gov. Portal Linked Data Lists SH UK Music Jamendo (En- uk Brainz (DBtune) LinkedL Ox AKTing) FanHubz gnoss ntnusc (DBTune) SSW CCN • Points Thesau- Last.FM Poké- Thesaur Popula- artists Didactal us rus W “Semantic Hyperlinks” between data items pédia LIBRIS tion (En- (DBTune) Last.FM ia theses. LCSH Rådata reegle research patents MARC AKTing) (rdfize) my fr nå! data.gov. data.go Codes Ren. NHS uk v.uk Good- Experi- Classical List Energy (En- win flickr ment (DB Pokedex Norwe- Genera- AKTing) Mortality BBC Family wrappr Sudoc PSH Tune) gian (En- tors Program MeSH AKTing) semantic mes BBC IdRef GND CO2 educatio OpenEI web.org SW Energy Sudoc ndlna Emission n.data.g Music Dog VIAF EEA (En- Chronic- Linked (En- ov.uk Portu- Food UB AKTing) ling Event MDB AKTing) guese Mann- Europeana BBC America Media DBpedia Calames heim Ord- Recht- Wildlife Deutsche Open Revyu DDC Openly spraak. Finder Bio- lobid Election nance legislation Local nl RDF graphie Resources NSZL • Data Survey Tele- data Ulm Swedish EU New Book Project data.gov.uk graphis bnf.fr Catalog Open Insti- York Mashup Every data item has a global identifier ... tutions URI Greek Open P20 Cultural UK Post- Times Burner DBpedia Calais Heritage codes statistics ECS Wiki lobid GovWILD data.gov. Taxon iServe South- Organi- LOIUS BNBBrazilian uk Concept ECS ampton sations Geo World OS BibBase STW GESIS Poli- ESD South- ECS Names Fact- (RKB ticians stan- reference ampton book • data.gov.uk Freebase Explorer) Budapest dards data.gov. NASA EPrints uk intervals Project OAI Lichfield (Data ... that looks like a web address (URI) ... transport DBpedia data Pisa Spen- Incu- Guten- dcs data.gov. RESEX Scholaro- ISTAT ding bator) Fishes berg DBLP DBLP uk Geo meter Immi- Scotland of Texas (FU (L3S) Pupils & Uberblic DBLP gration Species Berlin) IRIT Exams Euro- dbpedia data- (RKB London TCM ACM stat lite open- Explorer) NVD Gazette (FUB) Gene IBM Traffic Geo ac-uk • Scotland TWC LOGD Eurostat Daily DIT Linked UN/ Data UMBEL Med ERA Data LOCODE ... is linked and described using shared vocabularies DEPLOY Gov.ie CORDIS YAGO New- lingvoj Disea- (RKB some SIDER RAE2001 castle LOCAH CORDIS Explorer) Linked Eurécom Eurostat Drug CiteSeer Roma (FUB) Sensor Data GovTrack (Ontology (Kno.e.sis) Open Bank Pfam Course- Central) riese Enipedia Cyc Lexvo LinkedCT ware Linked PDB UniProt VIVO EURES EDGAR dotAC US SEC Indiana ePrints IEEE (Ontology totl.net (rdfabout) Central) WordNet RISKS (VUA) Taxono UniProt US Census EUNIS Twarql HGNC Semantic Cornetto (Bio2RDF) (rdfabout) my VIVO FTS XBRL PRO- ProDom STITCH Cornell LAAS SITE KISTI NSF Scotland GeoWord LODE • Geo- graphy Net WordNet WordNet JISC (W3C) Resource Description Framework (RDF) Climbing (RKB Affy- KEGG Linked VIVO UF SMC Explorer) SISVU metrix Pub Drug Piedmont Journals GeoData PubMed SGD ECCO- Finnish Gene Chem Munici- Accomo- El AGROV Ontology TCP Media dations Alpine bible palities Viajero OC Ski ontology Tourism KEGG Ocean Austria Enzyme PBAC Geographic Metoffice GEMET ChEMBL • Italian Drilling OMIM KEGG Weather Open public Codices AEMET Linked MGI Pathway Data Publications SPARQL query language & endpoint schools Forecasts Open InterPro GeneID KEGG EARTh Thesau- Turismo rus Colors Reaction de Zaragoza Product Smart KEGG User-generated content Weather DB Link Medi Glycan Janus Stations Product Care KEGG AMP UniParc UniRef UniSTS Government Types Italian Homolo Com- Yahoo! Airports Museums pound Ontology Google Gene Geo Art Planet National wrapper Chem2 Cross-domain Radio- Bio2RDF activity UniPath JP Sears Open Linked OGOLOD way Life sciences Corpo- Amster- Reactome dam medu- Open rates Numbers Museum cator As of September 2011
  • 12. Research Lifecycle Linked Data Cloud$ Analysis and Cloud Metrics acquiring$data$from$text?$ Ana Me Semi8 Semi-Automatic Querying and Automa;c$ Annotation Ranking Annota;on$ e.g.$GATE$ Amalgame$ SILK$ OpenCalais$ Que Graph$Rewri;ng$ Graph$Rewri;ng$ and$R Link to Other RDF Conversion Internal Linking Visualization Data RDF$ RDF$ Internal$ Link$to$ Conversion$ Cleaning$ Linking$ Other$Data$xml2rdf$ d2rq$ Visuardb2rdf$ Semi-Automatic Provenance $ Conversion Enrichment User Interfaces Provenance$ Enrichment$ U Inte RDF Feedback Semi8 Automa;c$ Provenance Tracking Conversion$ “tablinker”$
  • 13. Challenges• Build useful services and tools for data publishers ...• ... that maintain provenance information ...• ... and cater for the entire research cycle ...• ... including a feedback loop to new research
  • 14. Challenges• Build useful services and tools for data publishers ...• ... that maintain provenance information ...• ... and cater for the entire research cycle ...• ... including a feedback loop to new research
  • 15. Large Knowledge Collider• Data analysis pipeline• Custom workflows• Highly scalable• Query driven• Exposed as SPARQL endpoint
  • 16. Historical Census Data• Gathered from 1795 - 1971• Demographics, houses, occupations
  • 17. Historical Census Data• Gathered from 1795 - 1971• Demographics, houses, occupations
  • 18. Historical Census Data• Gathered from 1795 - 1971• Demographics, houses, occupations
  • 19. Historical Census Data• Gathered from 1795 - 1971• Demographics, houses, occupations• 507 Excel files• 2288 tables• 33283 annotations
  • 20. Annotations• Created at data entry time• Created as we speak• Corrections to original census tables• Corrections to excel version of census table• Any additonal remarks...
  • 21. Harmonization ?• Enable historical research across census years• Query across multiple heterogeneous datasets• Accommodate multiple interpretations
  • 22. Harmonization• Overcome structural heterogeneity• Overcome semantic heterogeneity • Different categories (age groups, locations) • Different values (names of religions, municipalities)
  • 23. Current Situation• Iterative refinement of MySQL database tables• Harmonization against existing codifications• Expensive manual process• Loss of information between harmonization steps• Loss of detail in mapping to existing codification• Not repeatable
  • 24. Requirements• (Semi-)automatic conversion and harmonization• Repeatable• Conservation of information (only add)• Provenance (who did what)• Flexible model• Linking to other datasets• Publish as open data
  • 25. Research Cycle Linked Data Cloud$ Analysis and Cloud Metrics acquiring$data$from$text?$ Ana Me Semi8 Semi-Automatic Querying and Automa;c$ Annotation Ranking Annota;on$ e.g.$GATE$ Amalgame$ SILK$ OpenCalais$ Que Graph$Rewri;ng$ Graph$Rewri;ng$ and$R Link to Other RDF Conversion Internal Linking Visualization Data RDF$ RDF$ Internal$ Link$to$ Conversion$ Cleaning$ Linking$ Other$Data$xml2rdf$ d2rq$ Visuardb2rdf$ Semi-Automatic Provenance $ Conversion Enrichment User Interfaces Provenance$ Enrichment$ U Inte RDF Feedback Semi8 Automa;c$ Provenance Tracking Conversion$ “tablinker”$
  • 26. TabLinkerhttp://github.com/Data2Semantics/TabLinker
  • 27. TabLinkerhttp://github.com/Data2Semantics/TabLinker
  • 28. TabLinkerhttp://github.com/Data2Semantics/TabLinker
  • 29. 12 1878 TabLinker M OI leeftijd ? http://github.com/Data2Semantics/TabLinker nummer der beroepsklasse ? geboortejaar ? geslacht ? huwelijkse staat E pannenbakkers beroep positie D 1 letter der beroepsklasse
  • 30. TabLinker• Verbatim graph representation of spreadsheet• Separate layer for semantics of spreadsheet• Separate graphs for any annotations, interpretations and harmonizations of the underlying data• Round-tripping from Excel to RDF and back
  • 31. Sheet1:E15 Sheet1:C14 Sheet1:B8 Sheet1:L15 Sheet1:L3 Sheet1:L4 Sheet1:L5 Sheet1:F15 Sheet1:D15 Sheet1:L6
  • 32. d2s:HierarchicalRowHeader d2s:DataCell d2s:Header rdf:type rdf:type rdf:type rdf:type rdf:type rdf:type rdf:typeSheet1:E15 Sheet1:C14 Sheet1:B8 Sheet1:L15 Sheet1:L3 Sheet1:L4 Sheet1:L5 Sheet1:F15 Sheet1:D15 Sheet1:L6 rdf:type rdf:type rdf:type d2s:RowHeader d2s:Metadata
  • 33. d2s:HierarchicalRowHeader d2s:HierarchicalRowHeader d2s:DataCell d2s:Header rdf:type rdf:type rdf:type rdf:type rdf:type rdf:type rdf:type rdf:type rdf:type rdf:type Sheet1:E15 Sheet1:E15 Sheet1:C14 Sheet1:C14 Sheet1:B8 Sheet1:B8 Sheet1:L15 Sheet1:L3 Sheet1:L4 Sheet1:L5 d2s:isDimension :I d2s:isDimension d2s:isObservation d2s:isDimension d2s:isDimensiond2s:isDimension :I/E _:x :14--15_1875--1874 d2s:isDimension :M :O Sheet1:I/E/Fabricage_van_dakpannen__pannenbakkers :D :5 :10 d2s:isDimension d2s:isDimension d2s:isDimension Sheet1:F15 Sheet1:D15 Sheet1:L6 rdf:type rdf:type rdf:type d2s:RowHeader d2s:Metadata
  • 34. d2s:HierarchicalRowHeader d2s:HierarchicalRowHeader d2s:DataCell d2s:Header rdf:type rdf:type rdf:type rdf:type rdf:type rdf:type rdf:type rdf:type rdf:type rdf:type Sheet1:E15 Sheet1:E15 Sheet1:C14 Sheet1:C14 Sheet1:B8 Sheet1:B8 Sheet1:L15 Sheet1:L3 Sheet1:L4 Sheet1:L5 d2s:isDimension :I d2s:isDimension "1"^^xsd:int d2s:isObservation d2s:isDimension skos:broader :Nummer_der_beroepsklasse d2s:isDimension d2s:populationSized2s:isDimension :I/E :Letter__Onderdeel_beroepsklasse_ _:x d2s:dimension :14--15_1875--1874 d2s:isDimension d2s:dimension skos:broader :M :BENAMING_van_de_onderdeelen_der_onderscheidene_beroepsklassen__met_de_daartoe_behoorende_beroepen d2s:dimension :Regelnummer :O :Positie_in_het_beroep__aangeduid_met_A__B__C_of_D d2s:dimension Sheet1:I/E/Fabricage_van_dakpannen__pannenbakkers :D :5 :10 d2s:isDimension d2s:isDimension d2s:isDimension Sheet1:F15 Sheet1:D15 Sheet1:L6 rdf:type rdf:type rdf:type d2s:RowHeader d2s:Metadata
  • 35. Harmonization within a year I skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl.Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) Sheet1:I skos:broader skos:broader skos:broader Sheet1:D Sheet1:E Sheet1:A skos:broader skos:broader skos:broader skos:broader Sheet1:Fabricage van Sheet1:Fabricage van steen Sheet1:Fabricage van aardewerk (incl. Sheet1:Fabricage (molensteen, steenbakkers, dakpannen porcelein, terracotta, van kalk tegelbakkers) (pannenbakkers) kachelbakkers, pottenbakkers, enz.)
  • 36. Harmonization across years I skos:broader skos:broader skos:broader D E A1889 skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.) skos:narrowMatch I skos:closeMatch skos:exactMatch skos:narrowMatch skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader 1899 skos:broader Fabricage van Fabricage van steen aardewerk (incl. Fabricage van Fabricage van dakpannen (steenbakkers, porcelein, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.)
  • 37. Harmonization external linking I skos:broader skos:broader skos:broader D E A skos:broader skos:broader skos:broader skos:broader Fabricage van Fabricage van steen aardewerk (incl.Fabricage van Fabricage van dakpannen (molensteen, steenbakkers, porcelein, terracotta, kalk (pannenbakkers) tegelbakkers) kachelbakkers, pottenbakkers, enz.)skos:exactMatch skos:broadMatch skos:broadMatch skos:closeMatch skos:exactMatch skos:exactMatch skos:exactMatch HISCO:23811 HISCO:25281 HISCO:25281 HISCO:26345 HISCO:23810 HISCO:25281 HISCO:26340 HISCO: Historical International Standard Classification of Occupations
  • 38. Curation & Annotation<http://example.com/workbook1/sheet1> <http://example.com/workbook1/sheet1/corrected> provo:Activity rdf:type :curation20120126 "1"^^xsd:int "11"^^xsd:int provo:wasGeneratedBy provo:hadAgent provo:startedAt d2s:populationSize d2s:populationSize provo:endedAt "1889"^^xsd:int :RinkeHoekstra d2s:censusYear _:x d2s:birthYears :1875--1874 _:b _:a d2s:gemeente d2s:dimension d2s:ageGroup time:inXSDDateTime time:inXSDDateTime :Assendelft :14--15_1875--1874 :14-15 "20120126T09:00:00" "20120126T08:30:00"
  • 39. Open Issues• Create the necessary mappings between graphs ... this is historical research• Mappings are interpretations• Query within a specified interpretation space• How to reliably perform statistical analysis across mappings?• How to study concept drift across years?
  • 40. Short Title Catalogue• All books published in NL until 1800• Digitized over a period of 30 years• 139817 publications (KB says >190000)• 9962 publishers• 23627 authors• 96024 links to scanned title pages
  • 41. Redactiebladen• Redactiebladen• PPN identifiers• KMC codes
  • 42. Requirements• (Semi-)automatic conversion and harmonization• Repeatable• Conservation of information (only add)• Provenance (who did what)• Flexible model• Linking to other datasets• Publish as open data
  • 43. Research Cycle Linked Data Cloud$ Analysis and Cloud Metrics acquiring$data$from$text?$ Ana Me Semi8 Semi-Automatic Querying and Automa;c$ Annotation Ranking Annota;on$ e.g.$GATE$ Amalgame$ SILK$ OpenCalais$ Que Graph$Rewri;ng$ Graph$Rewri;ng$ and$R Link to Other RDF Conversion Internal Linking Visualization Data RDF$ RDF$ Internal$ Link$to$ Conversion$ Cleaning$ Linking$ Other$Data$xml2rdf$ d2rq$ Visuardb2rdf$ Semi-Automatic Provenance $ Conversion Enrichment User Interfaces Provenance$ Enrichment$ U Inte RDF Feedback Semi8 Automa;c$ Provenance Tracking Conversion$ “tablinker”$
  • 44. Procedure• Convert to MySQL database Paul Huygen• Specify mapping to RDF D2RQ mapping language• Interlink with other datasources Bibliografish portaal, Rijksmuseum, Iconclass, Ecartico• Publish as browsable and queryable dataset http://stcn.data2semantics.org
  • 45. Procedure• Convert to MySQL database ✓ Paul Huygen• Specify mapping to RDF ✓ D2RQ mapping language• Interlink with other datasources Bibliografish portaal, Rijksmuseum, Iconclass, Ecartico• Publish as browsable and queryable dataset ✓ http://stcn.data2semantics.org
  • 46. http://stcn.data2semantics.org/resource/publicatie/337778825
  • 47. Fingerprints Wilhelmus Nakatenus S.J. (1617-1682) rdfs:label STCN:auteur/070082960 stcn:publicatie stcn:publicatie STCN:publicatie/ STCN:publicatie/ stcn:titeluitgave 336280211 314125434stcn:illustratie stcn:vingerafdruk stcn:vingerafdruk skos:exactMatch stcn:illustratie rdfs:label rdfs:label STCN:vingerafdruk/27 STCN:vingerafdruk/1207 Hemels palm-hof, ofte Groot getyde-boek rdfs:label rdfs:label 000012 - *b1 A4 ella : b2 2C7 ns$in
  • 48. Co-authors betweenness centrality (Gephi)
  • 49. Summary• We use a highly flexible modeling framework that ...• ... allows for rapid data publication and integration ...• ... that is extensible and distributed (DB = Web)...• ... allows for co-existing diverging interpretations ...• ... adheres to the law of conservation of information ..• ... offers existing methods for capturing provenance ...• ... allows for a closed loop research cycle.