DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

355 views

Published on

Determining the semantic relatedness (i.e., the strength of a relation) of two resources in DBpedia (or other Linked Data sources) is a problem addressed by quite a few approaches in the recent past. However, there are no large-scale benchmark datasets for comparing such approaches, and it is an open problem to determine which of the approaches work better than others. Furthermore, larget-scale datasets for training machine learning based approaches are not available. DBpediaNYD is a large-scale synthetic silver standard benchmark dataset which contains symmetric and asymmetric similarity values, obtained using a web search engine.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
355
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

  1. 1. DBpediaNYD – A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia 10/22/13 Paulheim Heiko Paulheim Heiko 1
  2. 2. Motivation • There are quite a few approaches to entity ranking/ statement weighting on Linked Data – and DBpedia in particular • Examples: – Franz et al. (2009) – Tensor Decomposition – Meij et al. (2009) – Machine Learning – Mirizzi et al. (2010) – Web Search Engines – Mulay and Kumar (2011) – Machine Learning – Hees et al. (2012) – Crowd Sourcing – Nunes et al. (2012) – Social Network Analysis 10/22/13 Heiko Paulheim 2
  3. 3. Motivation • However, – none of those have been competitively evaluated – none of those have been evaluated at large scale • Evaluation with – small private data sets – user studies • Approaches using Machine Learning – requires training data – expensive to obtain 10/22/13 Heiko Paulheim 3
  4. 4. The Dataset • Large-scale dataset (several thousand instances) – statements with strengths • Strength value: Normalized Google Distance • f(x): number of search results containing x • f(x,y): number of search results containing both x and y • M: number of pages in search engine index • NGD has been shown to correlate with human strength associations 10/22/13 Heiko Paulheim 4
  5. 5. The Dataset • NGD is a symmetric value – NYD dataset also contains asymmetric values • Asymmetric Normalized Google Distance • f(x): number of search results containing x • f(x,y): number of search results containing both x and y • M: number of pages in search engine index 10/22/13 Heiko Paulheim 5
  6. 6. Constructing the Dataset • We sampled 10,000 statements – with DBpedia resources as subject and object (e.g., no type statements, no literals) – with dbpedia or dbpprop predicate • ...and computed symmetric/asymmetric NGD – using the labels as search strings – using Yahoo BOSS 10/22/13 Heiko Paulheim 6
  7. 7. The Dataset • Random sample of 10,000 statements – i.e., 30,000 search engine calls (80c/1,000 → 24 USD) • 3,058 pairs of resources had to be discarded – f(x)<f(x,y) or f(y)<f(x,y) – search engines sometimes don't count properly :-( • Result: – 6,942 weighted statements (symmetric) – 13,884 weighted statements (asymmetric) 10/22/13 Heiko Paulheim 7
  8. 8. The Dataset • Example: – dbpedia:John_Lennon and dbpedia:Yoko_Ono • Distances: – symmetric: 0.18 – John Lennon → Yoko Ono 0.18 – Yoko Ono → John Lennon 0.03 • Explanation: – Yoko Ono is famous for being John Lennon's wife • and most often mentioned in that context – John Lennon is more famous for being a member of the Beatles 10/22/13 Heiko Paulheim 8
  9. 9. Example: the DBpedia FindRelated Service • We trained two regression SVMs (LibSVM) based on DBpediaNYD – one for symmetric, one for asymmetric – service allows for finding the most related among the linked resources • Example results: • http://wiki.dbpedia.org/FindRelated 10/22/13 Heiko Paulheim 9
  10. 10. Conclusion and Outlook • DBpediaNYD allows for large scale evaluation – rather a silver standard – does not replace manually created gold standards • Future work – validate DBpediaNYD with users – compare search engines 10/22/13 Heiko Paulheim 10
  11. 11. Something Completely Different • Challenges enumerated in the workshop intro this morning – “Logical inference on noisy data” • Talk on “Type Inference on Noisy RDF Data” – Was actually applied for DBpedia 3.9 – Friday, 3:15, Bayside 204A 10/22/13 Heiko Paulheim 11
  12. 12. DBpediaNYD – A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia 10/22/13 Paulheim Heiko Paulheim Heiko 12

×