Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From Linked Data to Tightly Integrated Data

1,284 views

Published on

Invited Talk at the 3rd Workshop on Linked Data in Linguistics: Multilingual Knowledge Resources and Natural Language Processing. Reykjavik, Iceland, 27th May 2014

The ideas behind the Web of Linked Data have great allure. Apart from the prospect of large amounts of freely available data, we are also promised nearly effortless interoperability. Common data formats and protocols have indeed made it easier than ever to obtain and work with information from different sources simultaneously, opening up new opportunities in linguistics, library science, and many other areas.
In this talk, however, I argue that the true potential of Linked Data can only be appreciated when extensive cross-linkage and integration engenders an even higher degree of interconnectedness. This can take the form of shared identifiers, e.g. those based on Wikipedia and WordNet, which can be used to describe numerous forms of linguistic and commonsense knowledge. An alternative is to rely on sameAs and similarity links, which can automatically be discovered using scalable approaches like the LINDA algorithm but need to be interpreted with great care, as we have observed in experimental studies. A closer level of linkage is achieved when resources are also connected at the taxonomic level, as exemplified by the MENTA approach to taxonomic data integration. Such integration means that one can buy into ecosystems already carrying a range of valuable pre-existing assets. Even more tightly integrated resources like Lexvo.org combine triples from multiple sources into unified, coherent knowledge bases. Finally, I also comment on how to address some remaining challenges that are still impeding a more widespread adoption of Linked Data on the Web. In the long run, I believe that such steps will lead us to significantly more tightly integrated Linked Data.

Published in: Technology
  • Be the first to comment

From Linked Data to Tightly Integrated Data

  1. 1. From Linked Data to Tightly Integrated Data From Linked Data to Tightly Integrated Data May 2014 Gerard de Melo May 2014 Gerard de Melo Tsinghua University, Beijing Tsinghua University, Beijing
  2. 2. 25 Years of the World Wide Web: 1989−2014 Tim Berners-Lee http://geekcom.wordpress.com/2009/03/19/ Gerard de Melo
  3. 3. 25 Years of the World Wide Web: 1989−2014 Tim Berners-Lee http://geekcom.wordpress.com/2009/03/19/ Documents for human viewing Gerard de Melo
  4. 4. FFrroomm TTeexxtt ttoo SSttrruuccttuurreedd DDaattaa October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE Source: Marko Grobelnik, Dunja Mladenic. KDD 2007. NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. Gerard de Melo
  5. 5. TThhee SSeemmaannttiicc WWeebb Tim Berners-Lee http://geekcom.wordpress.com/2009/03/19/ col-league born in Frankfurt described by created by Publish data in the right form right from the start created by Gerard de Melo
  6. 6. TThhee SSeemmaannttiicc WWeebb Assign URIs not just to Documents, also to People, etc. http://purl.org/dc/ elements/1.1./creator http://dblp.l3s.de/d2r/page/ http://www.demelo.org/gdm/#GDM publications/conf/cikm/MeloW09 Assign URIs to Predicates (Edge Types) created by Gerard de Melo
  7. 7. Challenge: Simplify Publishing Gerard de Melo
  8. 8. Challenge: Simplify Publishing http://www.gauson.com/blog/2007/12/09/minimal-template-for-blogspot/ Gerard de Melo
  9. 9. Challenge: Simplify Publishing Freebase: Better UI but not universal Gerard de Melo
  10. 10. BBiigg KKnnoowwlleeddggee GGrraapphhss Gerard de Melo
  11. 11. Big Knowledge Graphs YAGO2. Hoffart et al. WWW 2011. Gerard de Melo
  12. 12. Lexical Knowledge Bases Gerard de Melo
  13. 13. Etymological Wordnet LREC 2014 Poster Session P17 16:45-18:05 Also Christian Chiarcos today Gerard de Melo
  14. 14. LLeexxiiccaall IInntteennssiittyy OOrrddeerriinnggss ookkaayy < ggoooodd < ggrreeaatt < ssuuppeerrbb weak strong de Melo & Bansal Transactions of the ACL, 2013. Gerard de Melo
  15. 15. Metaphors: ICSI MetaNet Project Gerard de Melo
  16. 16. WWeebbCChhiilldd:: CCoommmmoonn--SSeennssee Common- Sense Relations, Properties, Comparisons Tandon et al. WSDM 2014. Tandon et al. AAAI 2014. Tandon et al. AAAI 2011. Gerard de Melo
  17. 17. LLiinnkkeedd DDaattaa iinn UUssee Input: Keywords, the World's Data Output: Address User's Needs Gerard de Melo
  18. 18. Linked Data In Use Gerard de Melo
  19. 19. Linked Data In Use used in IBM's Jeopardy!-winning Watson system Gerard de Melo
  20. 20. TThhee PPllaann Linked Data Really Linked Data Integrated Data Tightly Integrated Data
  21. 21. TThhee PPllaann Linked Data Really Linked Data Integrated Data Tightly Integrated Data
  22. 22. RReeaallllyy LLiinnkkeedd DDaattaa Just converting to RDF is trivial Gerard de Melo
  23. 23. RReeaallllyy LLiinnkkeedd DDaattaa use entities instead of literals where possible author BBooookk 2233 ““FFrraannzz KKaaffkkaa”” Gerard de Melo
  24. 24. RReeaallllyy LLiinnkkeedd DDaattaa use entities instead of literals where possible BBooookk 2233 ““FFrraannzz KKaaffkkaa”” author AAuutthhoorr 1144 name born in PPrraagguuee Gerard de Melo
  25. 25. RReeaallllyy LLiinnkkeedd DDaattaa use entities instead of literals where possible language PPeerrffoorrmmaannccee 11 ““eenn”” language PPeerrffoorrmmaannccee 22 ““EEnngglliisshh”” language PPeerrffoorrmmaannccee 33 ““eennggll..”” Gerard de Melo
  26. 26. RReeaallllyy LLiinnkkeedd DDaattaa use entities instead of literals where possible PPeerrffoorrmmaannccee 11 language PPeerrffoorrmmaannccee 22 language EEnngglliisshh PPeerrffoorrmmaannccee 33 language http://lexvo.org/id/iso639-3/eng Gerard de Melo
  27. 27. VVooccaabbuullaarryy // OOnnttoollooggyy RRee--UUssee http://lov.okfn.org/ Gerard de Melo
  28. 28. VVooccaabbuullaarryy // OOnnttoollooggyy RRee--UUssee Gerard de Melo
  29. 29. VVooccaabbuullaarryy // OOnnttoollooggyy RRee--UUssee Gerard de Melo
  30. 30. LLiinnkkeedd DDaattaa CClloouudd Gerard de Melo
  31. 31. LLiinnkkeedd DDaattaa CClloouudd Gerard de Melo
  32. 32. IIddeennttiiffiieerrss aanndd CCrroossss--LLiinnkkaaggee Arguably more important than RDF as a format Example: Google Knowledge Graph Buy into rich existing eco-systems Gerard de Melo
  33. 33. Focal Point: WordNet UWN (CIKM 2009): over 1,000,000 words in over 100 languages Gerard de Melo
  34. 34. UUWWNN//MMEENNTTAA:: UUnniivveerrssaall WWoorrddNNeett Gerard de Melo
  35. 35. FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg Lexvo.org Cyrllic (Script) UUkkrraaiinnee owl:sameAs UUkkrraaiinnee GeoNames UUUUkkkkrrrraaaaiiiinnnniiiiaaaannnn Gerard de Melo
  36. 36. FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg Lexvo.org Cyrllic (Script) UUkkrraaiinnee UUUUkkkkrrrraaaaiiiinnnniiiiaaaannnn My Resource UUkkrraaiinniiaann Lexvo.org API Identifiers .getLanguageURIforISO639P1("uk") Gerard de Melo
  37. 37. FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg “car”@en l:means sumo:Automobile lexvo:term/eng/car l:means sumo:Automobile Lexvo.org API Identifiers RDF .getTermURI("car", "eng") Gerard de Melo
  38. 38. FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg Gerard de Melo
  39. 39. FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg Gerard de Melo
  40. 40. FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg Gerard de Melo
  41. 41. Focal Point: Lexvo.org SSeemmaannttiicc WWeebb JJoouurrnnaall 22001144 Gerard de Melo
  42. 42. FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg LLeexxvvoo..oorrgg Roget's Thesaurus WordNet Evocation Links Etymological WordNet PropBank lexicon NomBank lexicon MPQA Subjectivity Lexicon AFINN Affective Lexicon CMU Pronunciation Dictionary Gerard de Melo
  43. 43. LLiinnkkeedd EEnnttiittiieess Source: Gerhard Weikum. For a few Triples more. Gerard de Melo
  44. 44. LLiinnkkeedd EEnnttiittiieess Gerard de Melo
  45. 45. LINDA: Creating Links Gerard de Melo
  46. 46. LINDA: Creating Links LINDA: Böhm et al. CIKM 2012 Gerard de Melo
  47. 47. LINDA: Creating Links LINDA: Böhm et al. CIKM 2012 Gerard de Melo
  48. 48. LINDA: Creating Links LINDA: Böhm et al. CIKM 2012 Gerard de Melo
  49. 49. LLLLIIIINNNNDDDDAAAA:::: CCCCrrrreeeeaaaattttiiiinnnngggg LLLLiiiinnnnkkkkssss LINDA: Böhm et al. CIKM 2012 Scale to Billion Triples Challenge Dataset despite dependencies Gerard de Melo
  50. 50. Lexvo.org SSaammeeAAss LLiinnkkss UUkkrraaiinnee owl:sameAs UUkkrraaiinnee GeoNames Leibnizian Identity For all x: x=x For all x, y, p: x=y => p(x)=p(y) Gerard de Melo
  51. 51. IIddeennttiittyy vvss.. NNeeaarr--IIddeennttiittyy Official Standard & Leibniz Automatic linkers & sameas.org EEiinnsstteeiinn owl:sameAs Einstein's Miracle Year Gerard de Melo
  52. 52. Merging Lexical Resources ACL 2010 AAAI 2013 Gerard de Melo
  53. 53. Merging Lexical Resources ACL 2010 AAAI 2013 Gerard de Melo
  54. 54. IIddeennttiittyy CCoonnssttrraaiinnttss ddbbppeeddiiaa:: PPaauullaa IIddeeaa:: Exploit Dataset-specific Unique Names Assumptions ddbbppeeddiiaa:: PPaauull dbpedia: Paulie (redirect) musicbrainz: Paulie ddbbllpp:: PPaauullaa ffrreeeebbaassee:: PPaauull Gerard de Melo
  55. 55. IIddeennttiittyy CCoonnssttrraaiinnttss ddbbppeeddiiaa:: PPaauullaa IIddeeaa:: Exploit Dataset-specific Unique Names Assumptions ddbbppeeddiiaa:: PPaauull musicbrainz: Paulie ddbbllpp:: PPaauullaa ffrreeeebbaassee:: PPaauull dbpedia: Paulie (redirect) Gerard de Melo
  56. 56. IIddeennttiittyy CCoonnssttrraaiinnttss ddbbppeeddiiaa:: PPaauullaa ffrreeeebbaassee:: PPaauull ddbbllpp:: PPaauullaa musicbrainz: Paulie ddbbppeeddiiaa:: PPaauull dbpedia: Paulie (redirect) UUssee sseett--bbaasseedd ffoorrmmaalliissmm ttoo aaccccoouunntt ffoorr eexxcceeppttiioonnss ++ ttoo aavvooiidd qquuaaddrraattiicc nnuummbbeerr ooff ppaaiirrwwiissee ccoonnssttrraaiinnttss Gerard de Melo
  57. 57. IIddeennttiittyy CCoonnssttrraaiinnttss ddbbppeeddiiaa:: PPaauull 2 2 ffrreeeebbaassee:: PPaauull ddbbllpp:: PPaauullaa musicbrainz: AAAAdddddddd eeeeddddggggeeee wwwweeeeiiiigggghhhhttttssss Paulie 1 1 1 1 dbpedia: Paulie (redirect) ddbbppeeddiiaa:: PPaauullaa GGooaall:: CCoonnssiisstteennccyy mmiinniimmiizziinngg wweeiigghhtteedd eeddggee ddeelleettiioonnss Gerard de Melo
  58. 58. AAllggoorriitthhmm See Paper for details, incl. relationship to Hungarian Algorithm and Graph Cuts Capture separation between nodes, which requires edge deletions along all paths Gerard de Melo
  59. 59. AAllggoorriitthhmm ddbbppeeddiiaa:: PPaauull 2 2 dbpedia: Paulie (redirect) musicbrainz: Paulie LLeeiigghhttoonn && RRaaoo ssttyyllee RReeggiioonn GGrroowwiinngg ddbbppeeddiiaa:: PPaauullaa ddbbllpp:: PPaauullaa ffrreeeebbaassee:: PPaauull 1 1 1 1 Gerard de Melo
  60. 60. AAllggoorriitthhmm ddbbppeeddiiaa:: PPaauull 2 2 dbpedia: Paulie (redirect) musicbrainz: Paulie LLeeiigghhttoonn && RRaaoo ssttyyllee RReeggiioonn GGrroowwiinngg ddbbppeeddiiaa:: PPaauullaa ddbbllpp:: PPaauullaa ffrreeeebbaassee:: PPaauull 1 1 1 1 Gerard de Melo
  61. 61. EExxppeerriimmeennttss BBTTCC:: Large Linked Data Web crawl, 20GB gzipped ssaammeeaass..oorrgg:: Most well-known collections of sameAs links, aggregated from various Linked Data sources Gerard de Melo
  62. 62. IIddeennttiittyy CCoonnssttrraaiinnttss Gerard de Melo
  63. 63. EExxppeerriimmeennttss >>550000,,000000 nnooddee ppaaiirrss,, bbuutt aallggoorriitthhmm rreemmoovveess oonnllyy 228800,,000000 eeddggeess Gerard de Melo
  64. 64. IIddeennttiittyy LLiinnkkss Must distinguish identity from near-identity Can automatically identify 500,000 inconsistent URI pairs Fix using LP Graph Algorithm Use more specific properties! lvont:strictlySameAs (Lexvo.org) skos:closeMatch etc. Gerard de Melo
  65. 65. QQuueessttiioonnss?? Image: Question Answering over Linked Data Workshop Gerard de Melo
  66. 66. TThhee PPllaann Linked Data Really Linked Data Integrated Data Tightly Integrated Data
  67. 67. Taxonomic Links a user wants a list of „Art Schools in Europe“ Gerard de Melo
  68. 68. Multilingual Taxonomies a Swedish user wants a list of „Konstskolor i Europa“ Gerard de Melo
  69. 69. MENTA 220000++ 200+ WWiikkiippeeddiiaa eeddiittiioonnss WWoorrddNNeett EEttcc.. Gerard de Melo
  70. 70. Predict Individual Identity Links: WordNet-Wikipedia Article-Redirect Article-Category etc. MENTA Gerard de Melo
  71. 71. MENTA Predict Individual Taxonomic Links: Article → Category Category → WordNet
  72. 72. MENTA Gerard de Melo
  73. 73. Taxonomic Links: MENTA Gerard de Melo
  74. 74. Taxonomic Links: MENTA Use Identity Constraint Algorithm to form equivalence classes Markov Chain Random Walk with Restarts to Rank Parents Gerard de Melo
  75. 75. Taxonomic Links: MENTA Gerard de Melo
  76. 76. UWN/MENTA CCIIKKMM 22001100 BBeesstt PPaappeerr AAwwaarrdd Gerard de Melo
  77. 77. MENTA: Multilingual Entity Taxonomy UWN/MENTA (de Melo & Weikum 2010) ● multilingual extension of WordNet, with 800,000 words in 250 languages ● 4,8 million instances/classes from multilingual Wikipedia editions Gerard de Melo
  78. 78. UWN/MENTA multilingual extension of WordNet for word senses and taxonomical information over 200 languages Gerard de Melo
  79. 79. QQuueessttiioonnss?? Image: Question Answering over Linked Data Workshop Gerard de Melo
  80. 80. TThhee PPllaann Linked Data Really Linked Data Integrated Data Tightly Integrated Data
  81. 81. CChhaalllleennggee:: LLoocckkeedd AAwwaayy DDaattaa Hard to run advanced algorithms over a SPARQL interface Many sites don't provide downloads. Gerard de Melo
  82. 82. CChhaalllleennggee:: LLoosstt DDaattaa http://sparqles.okfn.org/ Servers offline Poor archiving Dumps need to be archived and integrated. Gerard de Melo
  83. 83. CChhaalllleennggee:: UUppddaatteess Need to be able to update when data changes Need algorithmic solutions, not one-time process. YAGO2s: Biega et al. 2013 Gerard de Melo
  84. 84. Requirement: Integration Algorithm Pipelines Gerard de Melo Input: Various Data Output: Tightly Integrated Data
  85. 85. LLeexxvvoo..oorrgg SSeemmaannttiicc WWeebb JJoouurrnnaall 22001144 Gerard de Melo
  86. 86. LLeexxvvoo..oorrgg Gerard de Melo
  87. 87. LLeexxvvoo..oorrgg
  88. 88. LLeexxvvoo..oorrgg
  89. 89. LLeexxvvoo..oorrgg SSeemmaannttiicc WWeebb JJoouurrnnaall 22001144 Gerard de Melo
  90. 90. KKnnoowwlleeddggee GGrraapphhss Most large-scale knowledge bases have ground facts only bornIn(Einstein,Ulm) acquired(Microsoft,Powerset) But language is much more expressive ● ● All humans are mortal. ● ● At least three but not more than 10 people know this secret. ● ● Three years ago, most people believed that Microsoft would buy Yahoo within months. Gerard de Melo
  91. 91. CChhaalllleennggee:: TTiimmee TTeemmppoorraall ssccooppee mmiissssiinngg Source: Gerhard Weikum. For a few Triples more. Gerard de Melo
  92. 92. OOWWLL,, RRDDFFSS,, DDeessccrriippttiioonn LLooggiiccss WebProtégé http://protege.stanford.edu/ Limit expressivity to get decidability. Focus on class hierarchies and property axioms. Cannot create new rules e.g. to model “grandparent”, “uncle”, “legal adult”! Gerard de Melo
  93. 93. RReeaassoonniinngg Humans cannot act before being born (or, actually, before being conceived) (=> (and (human ?HUMAN) (birthdate ?HUMAN ?T) (agent ?PROCESS ?HUMAN)) (beforeOrEqual (daysBefore (BeginFn ?T) 365) (BeginFn (WhenFn ?PROCESS))))
  94. 94. RReeaassoonniinngg:: SSPPAASSSS--XXDDBB Gerard de Melo
  95. 95. Search Interfaces “Which companies were created during the last century in Silicon Valley ?” YAGO2: WWW 2011 Best Demo Award Gerard de Melo
  96. 96. Common-Sense Inference Gerard de Melo I found the following restaurant near your current location: La Dolce Vita Pizza. 2318 Columbus Ave. I'd rather have something healthier Tandon et al. AAAI 2014
  97. 97. Conclusion Really Linked Data ► Shared Identifiers ► Proper Interlinking Integrated Data ► Taxonomical Integration Tightly Integrated Data ► Processing Pipelines ► Towards Common-Sense Inference www.demelo.org gdm@demelo.org Gerard de Melo

×