Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
From Linked Data to 
Tightly Integrated Data 
From Linked Data to 
Tightly Integrated Data 
May 2014 
Gerard de Melo 
May ...
25 Years of the World Wide Web: 
1989−2014 
Tim Berners-Lee 
http://geekcom.wordpress.com/2009/03/19/ 
Gerard de Melo
25 Years of the World Wide Web: 
1989−2014 
Tim Berners-Lee http://geekcom.wordpress.com/2009/03/19/ 
Documents for 
human...
FFrroomm TTeexxtt ttoo SSttrruuccttuurreedd DDaattaa 
October 14, 2002, 4:00 a.m. PT 
For years, Microsoft Corporation CEO...
TThhee SSeemmaannttiicc WWeebb 
Tim Berners-Lee 
http://geekcom.wordpress.com/2009/03/19/ 
col-league 
born in Frankfurt 
...
TThhee SSeemmaannttiicc WWeebb 
Assign URIs not just to 
Documents, 
also to People, etc. 
http://purl.org/dc/ 
elements/1...
Challenge: 
Simplify Publishing 
Gerard de Melo
Challenge: 
Simplify Publishing 
http://www.gauson.com/blog/2007/12/09/minimal-template-for-blogspot/ 
Gerard de Melo
Challenge: 
Simplify Publishing 
Freebase: 
Better UI but 
not universal 
Gerard de Melo
BBiigg KKnnoowwlleeddggee GGrraapphhss 
Gerard de Melo
Big Knowledge Graphs 
YAGO2. 
Hoffart et al. 
WWW 2011. 
Gerard de Melo
Lexical Knowledge Bases 
Gerard de Melo
Etymological Wordnet 
LREC 2014 
Poster Session P17 
16:45-18:05 
Also 
Christian Chiarcos 
today 
Gerard de Melo
LLeexxiiccaall IInntteennssiittyy OOrrddeerriinnggss 
ookkaayy 
< 
ggoooodd 
< 
ggrreeaatt 
< 
ssuuppeerrbb 
weak 
strong ...
Metaphors: ICSI MetaNet Project 
Gerard de Melo
WWeebbCChhiilldd:: CCoommmmoonn--SSeennssee 
Common- 
Sense 
Relations, 
Properties, 
Comparisons 
Tandon et al. 
WSDM 201...
LLiinnkkeedd DDaattaa iinn UUssee 
Input: 
Keywords, the World's Data 
Output: 
Address User's 
Needs 
Gerard de Melo
Linked Data In Use 
Gerard de Melo
Linked Data In Use 
used in IBM's Jeopardy!-winning 
Watson system 
Gerard de Melo
TThhee PPllaann 
Linked Data 
Really Linked Data 
Integrated Data 
Tightly Integrated Data
TThhee PPllaann 
Linked Data 
Really Linked Data 
Integrated Data 
Tightly Integrated Data
RReeaallllyy LLiinnkkeedd DDaattaa 
Just converting to 
RDF is trivial 
Gerard de Melo
RReeaallllyy LLiinnkkeedd DDaattaa 
use entities 
instead of 
literals where 
possible 
author 
BBooookk 2233 ““FFrraannzz...
RReeaallllyy LLiinnkkeedd DDaattaa 
use entities 
instead of 
literals where 
possible 
BBooookk 2233 
““FFrraannzz KKaaff...
RReeaallllyy LLiinnkkeedd DDaattaa 
use entities 
instead of 
literals where 
possible 
language 
PPeerrffoorrmmaannccee 1...
RReeaallllyy LLiinnkkeedd DDaattaa 
use entities 
instead of 
literals where 
possible 
PPeerrffoorrmmaannccee 11 language...
VVooccaabbuullaarryy // OOnnttoollooggyy RRee--UUssee 
http://lov.okfn.org/ 
Gerard de Melo
VVooccaabbuullaarryy // OOnnttoollooggyy RRee--UUssee 
Gerard de Melo
VVooccaabbuullaarryy // OOnnttoollooggyy RRee--UUssee 
Gerard de Melo
LLiinnkkeedd DDaattaa CClloouudd 
Gerard de Melo
LLiinnkkeedd DDaattaa CClloouudd 
Gerard de Melo
IIddeennttiiffiieerrss aanndd CCrroossss--LLiinnkkaaggee 
Arguably more important than RDF 
as a format 
Example: Google K...
Focal Point: WordNet 
UWN (CIKM 2009): 
over 1,000,000 words in over 100 languages 
Gerard de Melo
UUWWNN//MMEENNTTAA:: UUnniivveerrssaall WWoorrddNNeett 
Gerard de Melo
FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg 
Lexvo.org 
Cyrllic 
(Script) 
UUkkrraaiinnee 
owl:sameAs 
UUkkrraaiinnee 
GeoN...
FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg 
Lexvo.org 
Cyrllic 
(Script) 
UUkkrraaiinnee 
UUUUkkkkrrrraaaaiiiinnnniiiiaaaa...
FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg 
“car”@en l:means sumo:Automobile 
lexvo:term/eng/car l:means sumo:Automobile 
...
FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg 
Gerard de Melo
FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg 
Gerard de Melo
FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg 
Gerard de Melo
Focal Point: Lexvo.org 
SSeemmaannttiicc WWeebb 
JJoouurrnnaall 22001144 
Gerard de Melo
FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg 
LLeexxvvoo..oorrgg 
Roget's 
Thesaurus 
WordNet 
Evocation Links 
Etymological...
LLiinnkkeedd EEnnttiittiieess 
Source: Gerhard Weikum. For a few Triples more. 
Gerard de Melo
LLiinnkkeedd EEnnttiittiieess 
Gerard de Melo
LINDA: Creating Links 
Gerard de Melo
LINDA: Creating Links 
LINDA: 
Böhm et al. 
CIKM 2012 
Gerard de Melo
LINDA: Creating Links 
LINDA: 
Böhm et al. 
CIKM 2012 
Gerard de Melo
LINDA: Creating Links 
LINDA: 
Böhm et al. 
CIKM 2012 
Gerard de Melo
LLLLIIIINNNNDDDDAAAA:::: CCCCrrrreeeeaaaattttiiiinnnngggg LLLLiiiinnnnkkkkssss 
LINDA: 
Böhm et al. 
CIKM 2012 
Scale to B...
Lexvo.org 
SSaammeeAAss LLiinnkkss 
UUkkrraaiinnee 
owl:sameAs 
UUkkrraaiinnee 
GeoNames 
Leibnizian Identity 
For all x: ...
IIddeennttiittyy vvss.. NNeeaarr--IIddeennttiittyy 
Official 
Standard 
& Leibniz 
Automatic 
linkers & 
sameas.org 
EEiin...
Merging Lexical Resources 
ACL 2010 
AAAI 2013 
Gerard de Melo
Merging Lexical Resources 
ACL 2010 
AAAI 2013 
Gerard de Melo
IIddeennttiittyy CCoonnssttrraaiinnttss 
ddbbppeeddiiaa:: PPaauullaa 
IIddeeaa:: 
Exploit 
Dataset-specific 
Unique Names ...
IIddeennttiittyy CCoonnssttrraaiinnttss 
ddbbppeeddiiaa:: PPaauullaa 
IIddeeaa:: 
Exploit 
Dataset-specific 
Unique Names ...
IIddeennttiittyy CCoonnssttrraaiinnttss 
ddbbppeeddiiaa:: PPaauullaa 
ffrreeeebbaassee:: PPaauull ddbbllpp:: PPaauullaa 
m...
IIddeennttiittyy CCoonnssttrraaiinnttss 
ddbbppeeddiiaa:: PPaauull 
2 2 
ffrreeeebbaassee:: PPaauull ddbbllpp:: PPaauullaa...
AAllggoorriitthhmm 
See Paper for 
details, incl. relationship to 
Hungarian Algorithm and 
Graph Cuts 
Capture separation...
AAllggoorriitthhmm 
ddbbppeeddiiaa:: PPaauull 
2 2 
dbpedia: 
Paulie (redirect) 
musicbrainz: 
Paulie 
LLeeiigghhttoonn &&...
AAllggoorriitthhmm 
ddbbppeeddiiaa:: PPaauull 
2 2 
dbpedia: 
Paulie (redirect) 
musicbrainz: 
Paulie 
LLeeiigghhttoonn &&...
EExxppeerriimmeennttss 
BBTTCC:: 
Large Linked Data Web crawl, 20GB gzipped 
ssaammeeaass..oorrgg:: 
Most well-known colle...
IIddeennttiittyy CCoonnssttrraaiinnttss 
Gerard de Melo
EExxppeerriimmeennttss 
>>550000,,000000 nnooddee ppaaiirrss,, 
bbuutt aallggoorriitthhmm rreemmoovveess 
oonnllyy 228800,...
IIddeennttiittyy LLiinnkkss 
Must distinguish identity from 
near-identity 
Can automatically identify 
500,000 inconsiste...
QQuueessttiioonnss?? 
Image: Question Answering over Linked Data Workshop 
Gerard de Melo
TThhee PPllaann 
Linked Data 
Really Linked Data 
Integrated Data 
Tightly Integrated Data
Taxonomic Links 
a user wants 
a list of 
„Art Schools in 
Europe“ 
Gerard de Melo
Multilingual Taxonomies 
a Swedish user 
wants 
a list of 
„Konstskolor i 
Europa“ 
Gerard de Melo
MENTA 
220000++ 200+ WWiikkiippeeddiiaa eeddiittiioonnss 
WWoorrddNNeett 
EEttcc.. 
Gerard de Melo
Predict Individual 
Identity Links: 
WordNet-Wikipedia 
Article-Redirect 
Article-Category 
etc. 
MENTA 
Gerard de Melo
MENTA 
Predict Individual 
Taxonomic Links: 
Article → Category 
Category → WordNet
MENTA 
Gerard de Melo
Taxonomic Links: 
MENTA 
Gerard de Melo
Taxonomic Links: 
MENTA 
Use Identity Constraint 
Algorithm to form 
equivalence classes 
Markov Chain Random 
Walk with R...
Taxonomic Links: 
MENTA 
Gerard de Melo
UWN/MENTA 
CCIIKKMM 22001100 
BBeesstt PPaappeerr AAwwaarrdd 
Gerard de Melo
MENTA: Multilingual 
Entity Taxonomy 
UWN/MENTA 
(de Melo & 
Weikum 2010) 
● multilingual 
extension of 
WordNet, with 
80...
UWN/MENTA 
multilingual extension of WordNet for 
word senses and taxonomical information over 200 languages 
Gerard de Me...
QQuueessttiioonnss?? 
Image: Question Answering over Linked Data Workshop 
Gerard de Melo
TThhee PPllaann 
Linked Data 
Really Linked Data 
Integrated Data 
Tightly Integrated Data
CChhaalllleennggee:: LLoocckkeedd AAwwaayy DDaattaa 
Hard to run 
advanced algorithms 
over a SPARQL 
interface 
Many site...
CChhaalllleennggee:: LLoosstt DDaattaa 
http://sparqles.okfn.org/ 
Servers offline 
Poor archiving 
Dumps need to 
be arch...
CChhaalllleennggee:: UUppddaatteess 
Need to be able to 
update when data 
changes 
Need algorithmic 
solutions, not 
one-...
Requirement: 
Integration Algorithm Pipelines 
Gerard de Melo 
Input: 
Various Data 
Output: 
Tightly Integrated 
Data
LLeexxvvoo..oorrgg 
SSeemmaannttiicc WWeebb 
JJoouurrnnaall 22001144 
Gerard de Melo
LLeexxvvoo..oorrgg 
Gerard de Melo
LLeexxvvoo..oorrgg
LLeexxvvoo..oorrgg
LLeexxvvoo..oorrgg 
SSeemmaannttiicc WWeebb 
JJoouurrnnaall 22001144 
Gerard de Melo
KKnnoowwlleeddggee GGrraapphhss 
Most large-scale knowledge bases have 
ground facts only 
bornIn(Einstein,Ulm) 
acquired(...
CChhaalllleennggee:: TTiimmee 
TTeemmppoorraall ssccooppee mmiissssiinngg 
Source: Gerhard Weikum. For a few Triples more....
OOWWLL,, RRDDFFSS,, DDeessccrriippttiioonn LLooggiiccss 
WebProtégé 
http://protege.stanford.edu/ 
Limit expressivity 
to ...
RReeaassoonniinngg 
Humans cannot act before being born 
(or, actually, before being conceived) 
(=> 
(and 
(human ?HUMAN)...
RReeaassoonniinngg:: SSPPAASSSS--XXDDBB 
Gerard de Melo
Search Interfaces 
“Which companies were created during the 
last century in Silicon Valley ?” 
YAGO2: 
WWW 2011 
Best Dem...
Common-Sense Inference 
Gerard de Melo 
I found the following restaurant 
near your current location: 
La Dolce Vita Pizza...
Conclusion 
Really Linked Data 
► Shared Identifiers 
► Proper Interlinking 
Integrated Data 
► Taxonomical Integration 
T...
Upcoming SlideShare
Loading in …5
×

From Linked Data to Tightly Integrated Data

1,260 views

Published on

Invited Talk at the 3rd Workshop on Linked Data in Linguistics: Multilingual Knowledge Resources and Natural Language Processing. Reykjavik, Iceland, 27th May 2014

The ideas behind the Web of Linked Data have great allure. Apart from the prospect of large amounts of freely available data, we are also promised nearly effortless interoperability. Common data formats and protocols have indeed made it easier than ever to obtain and work with information from different sources simultaneously, opening up new opportunities in linguistics, library science, and many other areas.
In this talk, however, I argue that the true potential of Linked Data can only be appreciated when extensive cross-linkage and integration engenders an even higher degree of interconnectedness. This can take the form of shared identifiers, e.g. those based on Wikipedia and WordNet, which can be used to describe numerous forms of linguistic and commonsense knowledge. An alternative is to rely on sameAs and similarity links, which can automatically be discovered using scalable approaches like the LINDA algorithm but need to be interpreted with great care, as we have observed in experimental studies. A closer level of linkage is achieved when resources are also connected at the taxonomic level, as exemplified by the MENTA approach to taxonomic data integration. Such integration means that one can buy into ecosystems already carrying a range of valuable pre-existing assets. Even more tightly integrated resources like Lexvo.org combine triples from multiple sources into unified, coherent knowledge bases. Finally, I also comment on how to address some remaining challenges that are still impeding a more widespread adoption of Linked Data on the Web. In the long run, I believe that such steps will lead us to significantly more tightly integrated Linked Data.

Published in: Technology
  • Be the first to comment

From Linked Data to Tightly Integrated Data

  1. 1. From Linked Data to Tightly Integrated Data From Linked Data to Tightly Integrated Data May 2014 Gerard de Melo May 2014 Gerard de Melo Tsinghua University, Beijing Tsinghua University, Beijing
  2. 2. 25 Years of the World Wide Web: 1989−2014 Tim Berners-Lee http://geekcom.wordpress.com/2009/03/19/ Gerard de Melo
  3. 3. 25 Years of the World Wide Web: 1989−2014 Tim Berners-Lee http://geekcom.wordpress.com/2009/03/19/ Documents for human viewing Gerard de Melo
  4. 4. FFrroomm TTeexxtt ttoo SSttrruuccttuurreedd DDaattaa October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE Source: Marko Grobelnik, Dunja Mladenic. KDD 2007. NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. Gerard de Melo
  5. 5. TThhee SSeemmaannttiicc WWeebb Tim Berners-Lee http://geekcom.wordpress.com/2009/03/19/ col-league born in Frankfurt described by created by Publish data in the right form right from the start created by Gerard de Melo
  6. 6. TThhee SSeemmaannttiicc WWeebb Assign URIs not just to Documents, also to People, etc. http://purl.org/dc/ elements/1.1./creator http://dblp.l3s.de/d2r/page/ http://www.demelo.org/gdm/#GDM publications/conf/cikm/MeloW09 Assign URIs to Predicates (Edge Types) created by Gerard de Melo
  7. 7. Challenge: Simplify Publishing Gerard de Melo
  8. 8. Challenge: Simplify Publishing http://www.gauson.com/blog/2007/12/09/minimal-template-for-blogspot/ Gerard de Melo
  9. 9. Challenge: Simplify Publishing Freebase: Better UI but not universal Gerard de Melo
  10. 10. BBiigg KKnnoowwlleeddggee GGrraapphhss Gerard de Melo
  11. 11. Big Knowledge Graphs YAGO2. Hoffart et al. WWW 2011. Gerard de Melo
  12. 12. Lexical Knowledge Bases Gerard de Melo
  13. 13. Etymological Wordnet LREC 2014 Poster Session P17 16:45-18:05 Also Christian Chiarcos today Gerard de Melo
  14. 14. LLeexxiiccaall IInntteennssiittyy OOrrddeerriinnggss ookkaayy < ggoooodd < ggrreeaatt < ssuuppeerrbb weak strong de Melo & Bansal Transactions of the ACL, 2013. Gerard de Melo
  15. 15. Metaphors: ICSI MetaNet Project Gerard de Melo
  16. 16. WWeebbCChhiilldd:: CCoommmmoonn--SSeennssee Common- Sense Relations, Properties, Comparisons Tandon et al. WSDM 2014. Tandon et al. AAAI 2014. Tandon et al. AAAI 2011. Gerard de Melo
  17. 17. LLiinnkkeedd DDaattaa iinn UUssee Input: Keywords, the World's Data Output: Address User's Needs Gerard de Melo
  18. 18. Linked Data In Use Gerard de Melo
  19. 19. Linked Data In Use used in IBM's Jeopardy!-winning Watson system Gerard de Melo
  20. 20. TThhee PPllaann Linked Data Really Linked Data Integrated Data Tightly Integrated Data
  21. 21. TThhee PPllaann Linked Data Really Linked Data Integrated Data Tightly Integrated Data
  22. 22. RReeaallllyy LLiinnkkeedd DDaattaa Just converting to RDF is trivial Gerard de Melo
  23. 23. RReeaallllyy LLiinnkkeedd DDaattaa use entities instead of literals where possible author BBooookk 2233 ““FFrraannzz KKaaffkkaa”” Gerard de Melo
  24. 24. RReeaallllyy LLiinnkkeedd DDaattaa use entities instead of literals where possible BBooookk 2233 ““FFrraannzz KKaaffkkaa”” author AAuutthhoorr 1144 name born in PPrraagguuee Gerard de Melo
  25. 25. RReeaallllyy LLiinnkkeedd DDaattaa use entities instead of literals where possible language PPeerrffoorrmmaannccee 11 ““eenn”” language PPeerrffoorrmmaannccee 22 ““EEnngglliisshh”” language PPeerrffoorrmmaannccee 33 ““eennggll..”” Gerard de Melo
  26. 26. RReeaallllyy LLiinnkkeedd DDaattaa use entities instead of literals where possible PPeerrffoorrmmaannccee 11 language PPeerrffoorrmmaannccee 22 language EEnngglliisshh PPeerrffoorrmmaannccee 33 language http://lexvo.org/id/iso639-3/eng Gerard de Melo
  27. 27. VVooccaabbuullaarryy // OOnnttoollooggyy RRee--UUssee http://lov.okfn.org/ Gerard de Melo
  28. 28. VVooccaabbuullaarryy // OOnnttoollooggyy RRee--UUssee Gerard de Melo
  29. 29. VVooccaabbuullaarryy // OOnnttoollooggyy RRee--UUssee Gerard de Melo
  30. 30. LLiinnkkeedd DDaattaa CClloouudd Gerard de Melo
  31. 31. LLiinnkkeedd DDaattaa CClloouudd Gerard de Melo
  32. 32. IIddeennttiiffiieerrss aanndd CCrroossss--LLiinnkkaaggee Arguably more important than RDF as a format Example: Google Knowledge Graph Buy into rich existing eco-systems Gerard de Melo
  33. 33. Focal Point: WordNet UWN (CIKM 2009): over 1,000,000 words in over 100 languages Gerard de Melo
  34. 34. UUWWNN//MMEENNTTAA:: UUnniivveerrssaall WWoorrddNNeett Gerard de Melo
  35. 35. FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg Lexvo.org Cyrllic (Script) UUkkrraaiinnee owl:sameAs UUkkrraaiinnee GeoNames UUUUkkkkrrrraaaaiiiinnnniiiiaaaannnn Gerard de Melo
  36. 36. FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg Lexvo.org Cyrllic (Script) UUkkrraaiinnee UUUUkkkkrrrraaaaiiiinnnniiiiaaaannnn My Resource UUkkrraaiinniiaann Lexvo.org API Identifiers .getLanguageURIforISO639P1("uk") Gerard de Melo
  37. 37. FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg “car”@en l:means sumo:Automobile lexvo:term/eng/car l:means sumo:Automobile Lexvo.org API Identifiers RDF .getTermURI("car", "eng") Gerard de Melo
  38. 38. FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg Gerard de Melo
  39. 39. FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg Gerard de Melo
  40. 40. FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg Gerard de Melo
  41. 41. Focal Point: Lexvo.org SSeemmaannttiicc WWeebb JJoouurrnnaall 22001144 Gerard de Melo
  42. 42. FFooccaall PPooiinntt:: LLeexxvvoo..oorrgg LLeexxvvoo..oorrgg Roget's Thesaurus WordNet Evocation Links Etymological WordNet PropBank lexicon NomBank lexicon MPQA Subjectivity Lexicon AFINN Affective Lexicon CMU Pronunciation Dictionary Gerard de Melo
  43. 43. LLiinnkkeedd EEnnttiittiieess Source: Gerhard Weikum. For a few Triples more. Gerard de Melo
  44. 44. LLiinnkkeedd EEnnttiittiieess Gerard de Melo
  45. 45. LINDA: Creating Links Gerard de Melo
  46. 46. LINDA: Creating Links LINDA: Böhm et al. CIKM 2012 Gerard de Melo
  47. 47. LINDA: Creating Links LINDA: Böhm et al. CIKM 2012 Gerard de Melo
  48. 48. LINDA: Creating Links LINDA: Böhm et al. CIKM 2012 Gerard de Melo
  49. 49. LLLLIIIINNNNDDDDAAAA:::: CCCCrrrreeeeaaaattttiiiinnnngggg LLLLiiiinnnnkkkkssss LINDA: Böhm et al. CIKM 2012 Scale to Billion Triples Challenge Dataset despite dependencies Gerard de Melo
  50. 50. Lexvo.org SSaammeeAAss LLiinnkkss UUkkrraaiinnee owl:sameAs UUkkrraaiinnee GeoNames Leibnizian Identity For all x: x=x For all x, y, p: x=y => p(x)=p(y) Gerard de Melo
  51. 51. IIddeennttiittyy vvss.. NNeeaarr--IIddeennttiittyy Official Standard & Leibniz Automatic linkers & sameas.org EEiinnsstteeiinn owl:sameAs Einstein's Miracle Year Gerard de Melo
  52. 52. Merging Lexical Resources ACL 2010 AAAI 2013 Gerard de Melo
  53. 53. Merging Lexical Resources ACL 2010 AAAI 2013 Gerard de Melo
  54. 54. IIddeennttiittyy CCoonnssttrraaiinnttss ddbbppeeddiiaa:: PPaauullaa IIddeeaa:: Exploit Dataset-specific Unique Names Assumptions ddbbppeeddiiaa:: PPaauull dbpedia: Paulie (redirect) musicbrainz: Paulie ddbbllpp:: PPaauullaa ffrreeeebbaassee:: PPaauull Gerard de Melo
  55. 55. IIddeennttiittyy CCoonnssttrraaiinnttss ddbbppeeddiiaa:: PPaauullaa IIddeeaa:: Exploit Dataset-specific Unique Names Assumptions ddbbppeeddiiaa:: PPaauull musicbrainz: Paulie ddbbllpp:: PPaauullaa ffrreeeebbaassee:: PPaauull dbpedia: Paulie (redirect) Gerard de Melo
  56. 56. IIddeennttiittyy CCoonnssttrraaiinnttss ddbbppeeddiiaa:: PPaauullaa ffrreeeebbaassee:: PPaauull ddbbllpp:: PPaauullaa musicbrainz: Paulie ddbbppeeddiiaa:: PPaauull dbpedia: Paulie (redirect) UUssee sseett--bbaasseedd ffoorrmmaalliissmm ttoo aaccccoouunntt ffoorr eexxcceeppttiioonnss ++ ttoo aavvooiidd qquuaaddrraattiicc nnuummbbeerr ooff ppaaiirrwwiissee ccoonnssttrraaiinnttss Gerard de Melo
  57. 57. IIddeennttiittyy CCoonnssttrraaiinnttss ddbbppeeddiiaa:: PPaauull 2 2 ffrreeeebbaassee:: PPaauull ddbbllpp:: PPaauullaa musicbrainz: AAAAdddddddd eeeeddddggggeeee wwwweeeeiiiigggghhhhttttssss Paulie 1 1 1 1 dbpedia: Paulie (redirect) ddbbppeeddiiaa:: PPaauullaa GGooaall:: CCoonnssiisstteennccyy mmiinniimmiizziinngg wweeiigghhtteedd eeddggee ddeelleettiioonnss Gerard de Melo
  58. 58. AAllggoorriitthhmm See Paper for details, incl. relationship to Hungarian Algorithm and Graph Cuts Capture separation between nodes, which requires edge deletions along all paths Gerard de Melo
  59. 59. AAllggoorriitthhmm ddbbppeeddiiaa:: PPaauull 2 2 dbpedia: Paulie (redirect) musicbrainz: Paulie LLeeiigghhttoonn && RRaaoo ssttyyllee RReeggiioonn GGrroowwiinngg ddbbppeeddiiaa:: PPaauullaa ddbbllpp:: PPaauullaa ffrreeeebbaassee:: PPaauull 1 1 1 1 Gerard de Melo
  60. 60. AAllggoorriitthhmm ddbbppeeddiiaa:: PPaauull 2 2 dbpedia: Paulie (redirect) musicbrainz: Paulie LLeeiigghhttoonn && RRaaoo ssttyyllee RReeggiioonn GGrroowwiinngg ddbbppeeddiiaa:: PPaauullaa ddbbllpp:: PPaauullaa ffrreeeebbaassee:: PPaauull 1 1 1 1 Gerard de Melo
  61. 61. EExxppeerriimmeennttss BBTTCC:: Large Linked Data Web crawl, 20GB gzipped ssaammeeaass..oorrgg:: Most well-known collections of sameAs links, aggregated from various Linked Data sources Gerard de Melo
  62. 62. IIddeennttiittyy CCoonnssttrraaiinnttss Gerard de Melo
  63. 63. EExxppeerriimmeennttss >>550000,,000000 nnooddee ppaaiirrss,, bbuutt aallggoorriitthhmm rreemmoovveess oonnllyy 228800,,000000 eeddggeess Gerard de Melo
  64. 64. IIddeennttiittyy LLiinnkkss Must distinguish identity from near-identity Can automatically identify 500,000 inconsistent URI pairs Fix using LP Graph Algorithm Use more specific properties! lvont:strictlySameAs (Lexvo.org) skos:closeMatch etc. Gerard de Melo
  65. 65. QQuueessttiioonnss?? Image: Question Answering over Linked Data Workshop Gerard de Melo
  66. 66. TThhee PPllaann Linked Data Really Linked Data Integrated Data Tightly Integrated Data
  67. 67. Taxonomic Links a user wants a list of „Art Schools in Europe“ Gerard de Melo
  68. 68. Multilingual Taxonomies a Swedish user wants a list of „Konstskolor i Europa“ Gerard de Melo
  69. 69. MENTA 220000++ 200+ WWiikkiippeeddiiaa eeddiittiioonnss WWoorrddNNeett EEttcc.. Gerard de Melo
  70. 70. Predict Individual Identity Links: WordNet-Wikipedia Article-Redirect Article-Category etc. MENTA Gerard de Melo
  71. 71. MENTA Predict Individual Taxonomic Links: Article → Category Category → WordNet
  72. 72. MENTA Gerard de Melo
  73. 73. Taxonomic Links: MENTA Gerard de Melo
  74. 74. Taxonomic Links: MENTA Use Identity Constraint Algorithm to form equivalence classes Markov Chain Random Walk with Restarts to Rank Parents Gerard de Melo
  75. 75. Taxonomic Links: MENTA Gerard de Melo
  76. 76. UWN/MENTA CCIIKKMM 22001100 BBeesstt PPaappeerr AAwwaarrdd Gerard de Melo
  77. 77. MENTA: Multilingual Entity Taxonomy UWN/MENTA (de Melo & Weikum 2010) ● multilingual extension of WordNet, with 800,000 words in 250 languages ● 4,8 million instances/classes from multilingual Wikipedia editions Gerard de Melo
  78. 78. UWN/MENTA multilingual extension of WordNet for word senses and taxonomical information over 200 languages Gerard de Melo
  79. 79. QQuueessttiioonnss?? Image: Question Answering over Linked Data Workshop Gerard de Melo
  80. 80. TThhee PPllaann Linked Data Really Linked Data Integrated Data Tightly Integrated Data
  81. 81. CChhaalllleennggee:: LLoocckkeedd AAwwaayy DDaattaa Hard to run advanced algorithms over a SPARQL interface Many sites don't provide downloads. Gerard de Melo
  82. 82. CChhaalllleennggee:: LLoosstt DDaattaa http://sparqles.okfn.org/ Servers offline Poor archiving Dumps need to be archived and integrated. Gerard de Melo
  83. 83. CChhaalllleennggee:: UUppddaatteess Need to be able to update when data changes Need algorithmic solutions, not one-time process. YAGO2s: Biega et al. 2013 Gerard de Melo
  84. 84. Requirement: Integration Algorithm Pipelines Gerard de Melo Input: Various Data Output: Tightly Integrated Data
  85. 85. LLeexxvvoo..oorrgg SSeemmaannttiicc WWeebb JJoouurrnnaall 22001144 Gerard de Melo
  86. 86. LLeexxvvoo..oorrgg Gerard de Melo
  87. 87. LLeexxvvoo..oorrgg
  88. 88. LLeexxvvoo..oorrgg
  89. 89. LLeexxvvoo..oorrgg SSeemmaannttiicc WWeebb JJoouurrnnaall 22001144 Gerard de Melo
  90. 90. KKnnoowwlleeddggee GGrraapphhss Most large-scale knowledge bases have ground facts only bornIn(Einstein,Ulm) acquired(Microsoft,Powerset) But language is much more expressive ● ● All humans are mortal. ● ● At least three but not more than 10 people know this secret. ● ● Three years ago, most people believed that Microsoft would buy Yahoo within months. Gerard de Melo
  91. 91. CChhaalllleennggee:: TTiimmee TTeemmppoorraall ssccooppee mmiissssiinngg Source: Gerhard Weikum. For a few Triples more. Gerard de Melo
  92. 92. OOWWLL,, RRDDFFSS,, DDeessccrriippttiioonn LLooggiiccss WebProtégé http://protege.stanford.edu/ Limit expressivity to get decidability. Focus on class hierarchies and property axioms. Cannot create new rules e.g. to model “grandparent”, “uncle”, “legal adult”! Gerard de Melo
  93. 93. RReeaassoonniinngg Humans cannot act before being born (or, actually, before being conceived) (=> (and (human ?HUMAN) (birthdate ?HUMAN ?T) (agent ?PROCESS ?HUMAN)) (beforeOrEqual (daysBefore (BeginFn ?T) 365) (BeginFn (WhenFn ?PROCESS))))
  94. 94. RReeaassoonniinngg:: SSPPAASSSS--XXDDBB Gerard de Melo
  95. 95. Search Interfaces “Which companies were created during the last century in Silicon Valley ?” YAGO2: WWW 2011 Best Demo Award Gerard de Melo
  96. 96. Common-Sense Inference Gerard de Melo I found the following restaurant near your current location: La Dolce Vita Pizza. 2318 Columbus Ave. I'd rather have something healthier Tandon et al. AAAI 2014
  97. 97. Conclusion Really Linked Data ► Shared Identifiers ► Proper Interlinking Integrated Data ► Taxonomical Integration Tightly Integrated Data ► Processing Pipelines ► Towards Common-Sense Inference www.demelo.org gdm@demelo.org Gerard de Melo

×