Building a structured catalog for educational
datasets
Stefan Dietze
04/07/13 1Stefan Dietze
Linked Open (educational) Data
 LOD: 300+ datasets, 32 billion
distinct RDF statements
 DataHub: 6000+ open datasets
2
 LinkedUp: FP7-ICT-2012-8, CSA
(http://linkedup-project.eu)
 Goal: enabling large-scale take-up of (Linked) Open Data
(education as application context)
Linked Open (educational) Data
 LOD: 300+ datasets, 32 billion
distinct RDF statements
 DataHub: 6000+ open datasets
http://datahub.io/dataset/bbc
60.000.000 triples
Using/exploiting Linked Data in Education ?
 Lack of reliable dataset metadata about
 Resource types
 Topics & disciplines
 Quality, currentness & availability
 Provenance
 Lack of links and cross-dataset references
 Lack of scalable query methods
Example dataset
description
3
04/07/13 4Stefan Dietze
Linked Data „Observatory“ – Processing Chain
Endpoint Retrieval
& Graph
Extraction
Schema
Extraction and
Mapping
Sample Graph
Extraction
(per dataset)
NER & NED
(per resource)
Interlinking & Co-
Resolution
(cross-dataset)
Category Mapping,
Normalisation,
Filtering
Dataset
Catalog/Index
Links/
Cross-references
rdfs:label:„…ECB….“
?
Dataset metadata (RDF/VoID):
 Schema mappings
(types, properties)
 Entities & categories
 Topic relevance scores
 Availability, currentness
data (tbc)
dbpedia:Finance
dbpedia:Sports
dbpedia:England-Wales-Cricket-Board
dbpedia:European_Central_Bank
Goals:
 RDF catalog of datasets
dataset of datasets
(classification of datasets
according to, eg,
represented types,
disciplines/topics, data
quality, accessability)
 Links and coreferences =>
unified view on data =>
Linked Education Graph
 Infrastructure & APIs for
federated queries
04/07/13 5Stefan Dietze
Linked Data „Observatory“ – Processing Chain
Endpoint Retrieval
& Graph
Extraction
Schema
Extraction and
Mapping
Sample Graph
Extraction
(per dataset)
NER & NED
(per resource)
Interlinking & Co-
Resolution
(cross-dataset)
Category Mapping,
Normalisation,
Filtering
Dataset
Catalog/Index
Links/
Cross-references
rdfs:label:„…ECB….“
?
Dataset metadata (RDF/VoID):
 Schema mappings
(types, properties)
 Entities & categories
 Topic relevance scores
 Availability, currentness
data (tbc)
dbpedia:Finance
dbpedia:Sports
dbpedia:England-Wales-Cricket-Board
dbpedia:European_Central_Bank
Assessing the Educational Linked Data
Landscape, D’Aquin, M., Adamou, A.,
Dietze, S., ACM Web Science 2013
(WebSci2013), Paris, France, May 2013.
Complex Matching of RDF Datatype
Properties, Nunes, B. P., Mera, A.,
Casanova, M. A., Fetahu, B., Paes Leme, L.
Dietze, S., 24th International Conference on
Database and Expert Systems Applications
– DEXA 2013, August 2013, Prague, CR.
Combining a co-occurrence-based and a
semantic measure for entity linking, B. P.
Nunes, S. Dietze, M.A. Casanova, R.
Kawase, B. Fetahu, and W. Nejdl. , ESWC
2013 - 10th Extended Semantic Web
Conference, (May 2013).
Indexing of Linked Data, What’s all the
data about, Fetahu, B; Adamou, A., Dietze,
S., d’Aquin, M., Nunes, B.P., ISWC2013 –
12th International Semantic Web
Conference; under review.
A Probabilistic Scheme for Keyword-
Based Incremental Query Construction.,
Demidova, E., Zhou, X, Nejdl, W., IEEE
Transactions on Knowledge and Data
Engineering, 24(3):426-439, 2012.
[DEXA13]
[WEBSCI13]
[ESWC13]
[ISWC13?]
[TKDE12]
04/07/13 6Stefan Dietze
<yov:Lecture8748720>
<yov:title>Pluto & the
Dwarf Planets</yov:title>
…
< yov:Lecture8748720>
Online Lecture
<ss:SlideSet-2139393292>
<title>Planetary motion
& gravity</title>
…
</ss:Slideset-2139393292>
Lecture Slideset
Relatedness of resources/entities?
(types, semantics)
Metadata about datasets?
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Video Documentary
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
Combining a co-occurrence-based and a semantic measure
for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R.
Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended
Semantic Web Conference, (May 2013).
Challenge: data heterogeneity
04/07/13 7Stefan Dietze
Combining a co-occurrence-based and a semantic measure
for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R.
Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended
Semantic Web Conference, (May 2013).
Data disambiguation, linking & annotation
<yov:Lecture8748720>
<yov:title>Pluto & the
Dwarf Planets</yov:title>
…
< yov:Lecture8748720>
Online Lecture
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Brian Cox?
Sun?
Pluto?
Video Documentary
db:Pluto
(Dwarf Planet)
db:Astrono-
mical Objects
db:Sun
04/07/13 8Stefan Dietze
Combining a co-occurrence-based and a semantic measure
for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R.
Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended
Semantic Web Conference, (May 2013).
Data disambiguation, linking & annotation
db:Astronomy
<yov:Lecture8748720>
<yov:title>Pluto & the
Dwarf Planets</yov:title>
…
< yov:Lecture8748720>
Online Lecture
<ss:SlideSet-2139393292>
<title>Planetary motion
& gravity</title>
…
</ss:Slideset-2139393292>
Lecture Slideset
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Video Documentary
db:Pluto
(Dwarf Planet)
db:Astrono-
mical Objects
04/07/13 9Stefan Dietze
Combining a co-occurrence-based and a semantic measure
for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R.
Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended
Semantic Web Conference, (May 2013).
Data disambiguation, linking & annotation
<yov:Lecture8748720>
<title>Pluto & the Dwarf
Planets</title>
…
< yov:Lecture8748720>
Online Lecture
db:Astronomy
 Computation of connectivity scores
between resources/entities
 Method: combination of a
 (i) semantic (graph-based) connectivity
score (SCS) with
 (ii) a Web co-occurence-based measure
(CBM) (similar to NGD)
 For (i): adaptation of Katz-Index from SNA
for (linked) data graphs (considering path
number and path lengths of transversal
properties)
Data linking
Dataset categorisation: computation of
normalised (DBpedia) category relevance
scores for datasets
db:Sun
SCS = 0.32
CBM = 0.24
<ss:SlideSet-2139393292>
<title>Planetary motion
& gravity</title>
…
</ss:Slideset-2139393292>
Lecture Slideset
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Video Documentary
Data disambiguation, linking & annotation
Combining a co-occurrence-based and a semantic measure
for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R.
Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended
Semantic Web Conference, (May 2013).
04/07/13 10Stefan Dietze
 Evaluation based on USA Today News items (80.000 entity pairs)
 Manually created gold standard
(1000 entity pairs)
 Baseline: Explicit Semantic Analysis (ESA)
=> CBM/SCS: „relatedness“; ESA: „similarity“
Precision/Recall/F1 for SCS, CBM, ESA.
Enhanced dataset descriptions
on the DataHub
Dataset RDF graph: correlations
based on semantic annotations (categories)
Dataset classification: expanded dataset catalog & graph
04/07/13 11Stefan Dietze
http://linkedup-project.eu
http://data.linkededucation.org/linkedup/catalog/
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
04/07/13 12Stefan Dietze
Thank you!
http://purl.org/dietze

A structured catalog of open educational datasets

  • 1.
    Building a structuredcatalog for educational datasets Stefan Dietze 04/07/13 1Stefan Dietze
  • 2.
    Linked Open (educational)Data  LOD: 300+ datasets, 32 billion distinct RDF statements  DataHub: 6000+ open datasets 2  LinkedUp: FP7-ICT-2012-8, CSA (http://linkedup-project.eu)  Goal: enabling large-scale take-up of (Linked) Open Data (education as application context)
  • 3.
    Linked Open (educational)Data  LOD: 300+ datasets, 32 billion distinct RDF statements  DataHub: 6000+ open datasets http://datahub.io/dataset/bbc 60.000.000 triples Using/exploiting Linked Data in Education ?  Lack of reliable dataset metadata about  Resource types  Topics & disciplines  Quality, currentness & availability  Provenance  Lack of links and cross-dataset references  Lack of scalable query methods Example dataset description 3
  • 4.
    04/07/13 4Stefan Dietze LinkedData „Observatory“ – Processing Chain Endpoint Retrieval & Graph Extraction Schema Extraction and Mapping Sample Graph Extraction (per dataset) NER & NED (per resource) Interlinking & Co- Resolution (cross-dataset) Category Mapping, Normalisation, Filtering Dataset Catalog/Index Links/ Cross-references rdfs:label:„…ECB….“ ? Dataset metadata (RDF/VoID):  Schema mappings (types, properties)  Entities & categories  Topic relevance scores  Availability, currentness data (tbc) dbpedia:Finance dbpedia:Sports dbpedia:England-Wales-Cricket-Board dbpedia:European_Central_Bank Goals:  RDF catalog of datasets dataset of datasets (classification of datasets according to, eg, represented types, disciplines/topics, data quality, accessability)  Links and coreferences => unified view on data => Linked Education Graph  Infrastructure & APIs for federated queries
  • 5.
    04/07/13 5Stefan Dietze LinkedData „Observatory“ – Processing Chain Endpoint Retrieval & Graph Extraction Schema Extraction and Mapping Sample Graph Extraction (per dataset) NER & NED (per resource) Interlinking & Co- Resolution (cross-dataset) Category Mapping, Normalisation, Filtering Dataset Catalog/Index Links/ Cross-references rdfs:label:„…ECB….“ ? Dataset metadata (RDF/VoID):  Schema mappings (types, properties)  Entities & categories  Topic relevance scores  Availability, currentness data (tbc) dbpedia:Finance dbpedia:Sports dbpedia:England-Wales-Cricket-Board dbpedia:European_Central_Bank Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. Complex Matching of RDF Datatype Properties, Nunes, B. P., Mera, A., Casanova, M. A., Fetahu, B., Paes Leme, L. Dietze, S., 24th International Conference on Database and Expert Systems Applications – DEXA 2013, August 2013, Prague, CR. Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl. , ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). Indexing of Linked Data, What’s all the data about, Fetahu, B; Adamou, A., Dietze, S., d’Aquin, M., Nunes, B.P., ISWC2013 – 12th International Semantic Web Conference; under review. A Probabilistic Scheme for Keyword- Based Incremental Query Construction., Demidova, E., Zhou, X, Nejdl, W., IEEE Transactions on Knowledge and Data Engineering, 24(3):426-439, 2012. [DEXA13] [WEBSCI13] [ESWC13] [ISWC13?] [TKDE12]
  • 6.
    04/07/13 6Stefan Dietze <yov:Lecture8748720> <yov:title>Pluto& the Dwarf Planets</yov:title> … < yov:Lecture8748720> Online Lecture <ss:SlideSet-2139393292> <title>Planetary motion & gravity</title> … </ss:Slideset-2139393292> Lecture Slideset Relatedness of resources/entities? (types, semantics) Metadata about datasets? <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Video Documentary Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). Challenge: data heterogeneity
  • 7.
    04/07/13 7Stefan Dietze Combininga co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). Data disambiguation, linking & annotation <yov:Lecture8748720> <yov:title>Pluto & the Dwarf Planets</yov:title> … < yov:Lecture8748720> Online Lecture <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Brian Cox? Sun? Pluto? Video Documentary
  • 8.
    db:Pluto (Dwarf Planet) db:Astrono- mical Objects db:Sun 04/07/138Stefan Dietze Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). Data disambiguation, linking & annotation db:Astronomy <yov:Lecture8748720> <yov:title>Pluto & the Dwarf Planets</yov:title> … < yov:Lecture8748720> Online Lecture <ss:SlideSet-2139393292> <title>Planetary motion & gravity</title> … </ss:Slideset-2139393292> Lecture Slideset <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Video Documentary
  • 9.
    db:Pluto (Dwarf Planet) db:Astrono- mical Objects 04/07/139Stefan Dietze Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). Data disambiguation, linking & annotation <yov:Lecture8748720> <title>Pluto & the Dwarf Planets</title> … < yov:Lecture8748720> Online Lecture db:Astronomy  Computation of connectivity scores between resources/entities  Method: combination of a  (i) semantic (graph-based) connectivity score (SCS) with  (ii) a Web co-occurence-based measure (CBM) (similar to NGD)  For (i): adaptation of Katz-Index from SNA for (linked) data graphs (considering path number and path lengths of transversal properties) Data linking Dataset categorisation: computation of normalised (DBpedia) category relevance scores for datasets db:Sun SCS = 0.32 CBM = 0.24 <ss:SlideSet-2139393292> <title>Planetary motion & gravity</title> … </ss:Slideset-2139393292> Lecture Slideset <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Video Documentary
  • 10.
    Data disambiguation, linking& annotation Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). 04/07/13 10Stefan Dietze  Evaluation based on USA Today News items (80.000 entity pairs)  Manually created gold standard (1000 entity pairs)  Baseline: Explicit Semantic Analysis (ESA) => CBM/SCS: „relatedness“; ESA: „similarity“ Precision/Recall/F1 for SCS, CBM, ESA.
  • 11.
    Enhanced dataset descriptions onthe DataHub Dataset RDF graph: correlations based on semantic annotations (categories) Dataset classification: expanded dataset catalog & graph 04/07/13 11Stefan Dietze http://linkedup-project.eu http://data.linkededucation.org/linkedup/catalog/ Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013.
  • 12.
    04/07/13 12Stefan Dietze Thankyou! http://purl.org/dietze