Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lotus: Linked Open Text UnleaShed - ISWC COLD '15

380 views

Published on

Abstract:
It is difficult to find resources on the Semantic Web today, in particular if one wants to search for resources based on natural language keywords and across multiple datasets.
In this paper, we present \lotus: Linked Open Text UnleaShed, a full-text lookup index over a huge Linked Open Data collection.
We detail \lotus' approach, its implementation, its coverage, and demonstrate the ease with which it allows the LOD cloud to be queried in different domain-specific scenarios.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Lotus: Linked Open Text UnleaShed - ISWC COLD '15

  1. 1. LOTUS: Linked Open Text UnleaShed
  2. 2. Marieke van Erp Stefan Schlobach Wouter Beek Filip Ilievski Laurens Rietveld Authors
  3. 3. Consuming LD Finding relevant LD resources based on natural text Central for application areas: Information Retrieval Named Entity Linking “central indices (e.g. Sindice) have disappeared”
  4. 4. “HELR : The Harvard Environmental Law Review” Let’s play a game ... Find Linked Open Data resources for:
  5. 5. How do we find relevant resources on the Semantic Web today ? literals are not dereferenceable by definition 1) Dereference 2) SPARQL endpoints find resources only in explicitly stated set of data sets exact or substring/regex matching
  6. 6. Sometimes we do this ...
  7. 7. “COLD means using SPARQL on centralized endpoints”
  8. 8. Summarising: Findability is a problem on SW today We need : ● a single entry point to the Linked Open Data cloud ● to find resources based on approximate text matching
  9. 9. Towards the Findability problem of the SW We need : ● a single entry point to the Linked Open Data ● to find resources based on approximate text matching LOD Laundromat LOTUS
  10. 10. #1 LOD Laundromat Infrastructure that washes other people’s dirty data and republishes it as RDF Central entry point to the Linked Open Data cloud
  11. 11. 657,885 documents 38,606,408,433 statements #1 LOD Laundromat ... can be simultaneously queried from the Laundromat Wardrobe
  12. 12. #2 LOTUS Full-text lookup index over LOD Laundromat Finds resources based on associated natural text Inspired by application areas: IR and NED
  13. 13. LOTUS’ approach Text2Literal mapping (and onwards to documents and resources) for described resources (with at least one associated literal) that contain natural text (numbers and dates are not findable) through a rich string approximation model. (substring, phonetic, synonym matching, TF-IDF scoring, match granularity )
  14. 14. Implementation - index builder
  15. 15. Implementation - index builder
  16. 16. subject predicate object string user langtag document ID Index fields
  17. 17. Implementation - Public Interface
  18. 18. Implementation - Public Interface
  19. 19. Implementation - Public Interface
  20. 20. PHRASE: substring matching phrase(“Harvard Environmental Law Review”) TERMS: lookup a set of terms terms(“HELR. Harvard ELR Environmental Law Review”) *optionally, supply a langtag: phrase(“Harvard Environmental Law Review”, “en”) Query modes
  21. 21. 5,319,790,836 natural text literals 12,018,939,378 literals LOTUS v1.0 statistics 474.77 GB disk 56 hours
  22. 22. Preliminary Evaluation 191 local monuments, manually extracted from Dutch tour guide List of 231 scientific journals from a Norwegian Social Sciences Data Services website
  23. 23. Preliminary Evaluation Text queries for which we find at least one resource Local Monuments Scientific journals 191 231
  24. 24. Preliminary Evaluation Text queries for which we find at least one resource Local Monuments Scientific journals Overall % 191 231 in DBpedia (via SPARQL) 53 77 30.8%
  25. 25. Preliminary Evaluation Text queries for which we find at least one resource Local Monuments Scientific journals Overall % 191 231 in DBpedia (via SPARQL) 53 77 30.8% in DBPedia (via LOTUS phrase) 165 182 82.2%
  26. 26. Preliminary Evaluation Text queries for which we find at least one resource Local Monuments Scientific journals Overall % 191 231 in DBpedia (via SPARQL) 53 77 30.8% in DBPedia (via LOTUS phrase) 165 182 82.2% in LOD (via LOTUS phrase) 168 216 91.0%
  27. 27. Preliminary Evaluation Text queries for which we find at least one resource Local Monuments Scientific journals Overall % 191 231 in DBpedia (via SPARQL) 53 77 30.8% in DBPedia (via LOTUS phrase) 165 182 82.2% in LOD (via LOTUS phrase) 168 216 91.0% in LOD (via LOTUS terms) 188 231 99.3%
  28. 28. Start towards a natural text index over the LOD cloud 5.3B indexed literals can be looked up Query modes for approximate matching Accessible through web frontend and API LOTUS v1.0
  29. 29. Current work (LOTUS v1.1) Add langtags through Automatic language detection Extract knowledge base information from URIs Extract meaning of formatting convention from URIs Add conjunctive & fuzzy query modes
  30. 30. Future work Evaluation of precision ◎ task-specific (IR, NED) Integration of structured and unstructured data Relevance and ranking
  31. 31. http://lotus.lodlaundromat. org
  32. 32. Thanks! We would love to hear your comments and suggestions !
  33. 33. http://lotus.lodlaundromat.org LOD Laundromat cleans and republishes LOD, making it reachable via single access point LOTUS finds LOD Laundromat resources based on natural text.
  34. 34. f.ilievski@vu.nl w.g.j.beek@vu.nl marieke.van.erp@vu.nl laurens.rietveld@vu.nl k.s.schlobach@vu.nl Contact
  35. 35. Appendices
  36. 36. LOTUS vs Sindice Sindice LOTUS Relate URIs and literals to documents Relate URIs, literals and documents to each other Accepts URIs which can be dereferenceable or have a SPARQL endpoint Accepts any type of data Partially incorrect datasets are excluded Partially incorrect datasets are included Relies on original URI availability Original URI can be ‘down’ 30M URIs & 45M literals 3,700M URIs & 5,320M literals
  37. 37. You will have a bad time finding these via SPARQL “National Socialist German Workers' Party Foreign Organisation” “The NSDAP/AO was the Foreign Organization of the National Socialist German Workers Party (NSDAP).” “De 9 straatjes” “Negen straatjes (Amsterdam), 9 straatjes”@nl “Shopping guide: negen straatjes”@nl-NL
  38. 38. You will have a bad time finding these in SPARQL "1375 W Lake Street" "1501 W. Randolph St." "29 North 7th Street" "Fritz-Pregl-Str. 5"@en "33-35 Stoke Newington Road" "Trompsingel 27" "226 Broadway, 2nd Floor" "Shinbo Building, 402-22, B1 Seogyo-dong, Mapo-gu"
  39. 39. Preliminary Evaluation (recall) % of DBpedia resources in top 100 results Local Monuments Scientific journals LOTUS phrase 70.48% 24.83% LOTUS terms 67.19% 22.33%
  40. 40. Preliminary Evaluation Measured recall of : ◎ NIL entities from CoNLL/AIDA ◎ Local monuments ◎ List of scientific journals
  41. 41. LOTUS v1.1 object+predicate + subject string user+auto langtag document ID predicate subject Term-based query Phrase-based query Conjunctive query Fuzzy query + language tag matching

×