SlideShare a Scribd company logo
1 of 26
Download to read offline
A Comparison of Supervised Learning Classi
ers 
for Link Discovery 
Tommaso Soru and Axel-Cyrille Ngonga Ngomo 
Agile Knowledge Engineering and Semantic Web 
Department of Computer Science 
University of Leipzig 
Augustusplatz 10, 04109 Leipzig 
ftsoru,ngongag@informatik.uni-leipzig.de 
http://aksw.org 
September 4, 2014
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Introduction/1 
The 4th Linked Data Web Principle. 
Include links to other URIs, so that they can discover more 
things." { Tim Berners-Lee 
31B triples in 2011 
of which only  3% link 
dierent datasets 
 71B triples expected in 
2014 
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
ers for Link Discovery 
2 / 18
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Introduction/2 
Link Discovery 
What? Discover new links among resources. 
How? Using supervised and unsupervised methods. 
Why? Links are important for data integration, question 
answering, knowledge extraction. 
We will focus on supervised machine-learning algorithms. 
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
ers for Link Discovery 
3 / 18
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Introduction/2 
Link Discovery 
What? Discover new links among resources. 
How? Using supervised and unsupervised methods. 
Why? Links are important for data integration, question 
answering, knowledge extraction. 
We will focus on supervised machine-learning algorithms. 
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
ers for Link Discovery 
3 / 18
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Preliminaries 
Link Discovery. 
Given two datasets S and T, the general aim of link discovery is to
nd the set 
of resource pairs (s; t) 2 S  T such that R(s; t) holds, where R is a given 
relation such as owl:sameAs or dbp:near. 
Link Speci
cation. 
A link speci
cation is a rule composed by a complex similarity function sim and 
a threshold  that de
nes which pairs (s; t) should be linked together: 
sim(s; t)   
Main problems 
1 Nave approaches demand quadratic time complexity. 
2 Ecient algorithms ; accurate link speci
cations. 
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
ers for Link Discovery 
4 / 18
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Preliminaries 
Link Discovery. 
Given two datasets S and T, the general aim of link discovery is to
nd the set 
of resource pairs (s; t) 2 S  T such that R(s; t) holds, where R is a given 
relation such as owl:sameAs or dbp:near. 
Link Speci
cation. 
A link speci
cation is a rule composed by a complex similarity function sim and 
a threshold  that de
nes which pairs (s; t) should be linked together: 
sim(s; t)   
Main problems 
1 Nave approaches demand quadratic time complexity. 
2 Ecient algorithms ; accurate link speci
cations. 
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
ers for Link Discovery 
4 / 18
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Preliminaries 
Link Discovery. 
Given two datasets S and T, the general aim of link discovery is to
nd the set 
of resource pairs (s; t) 2 S  T such that R(s; t) holds, where R is a given 
relation such as owl:sameAs or dbp:near. 
Link Speci
cation. 
A link speci
cation is a rule composed by a complex similarity function sim and 
a threshold  that de

More Related Content

Similar to A Comparison of Supervised Learning Classifiers for Link Discovery

A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
 
Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithmsIkutwa
 
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...tmra
 
Towards Transfer Learning of Link Specifications
Towards Transfer Learning of Link SpecificationsTowards Transfer Learning of Link Specifications
Towards Transfer Learning of Link Specificationsgeoknow
 
Semantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionTELKOMNIKA JOURNAL
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.Giuseppe Ricci
 
Knowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsKnowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsEnrico Palumbo
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...BaoTramDuong2
 
A03730108
A03730108A03730108
A03730108theijes
 
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...Hendrik Drachsler
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Xiaohan Zeng
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalMounia Lalmas-Roelleke
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterSudarsun Santhiappan
 
A Soft Set-based Co-occurrence for Clustering Web User Transactions
A Soft Set-based Co-occurrence for Clustering Web User TransactionsA Soft Set-based Co-occurrence for Clustering Web User Transactions
A Soft Set-based Co-occurrence for Clustering Web User TransactionsTELKOMNIKA JOURNAL
 
Advantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information RetrievalAdvantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information RetrievalOnur Yılmaz
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 

Similar to A Comparison of Supervised Learning Classifiers for Link Discovery (20)

A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithms
 
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
 
Towards Transfer Learning of Link Specifications
Towards Transfer Learning of Link SpecificationsTowards Transfer Learning of Link Specifications
Towards Transfer Learning of Link Specifications
 
Semantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detection
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.
 
Knowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsKnowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender Systems
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...
 
A03730108
A03730108A03730108
A03730108
 
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
 
Ijetcas14 624
Ijetcas14 624Ijetcas14 624
Ijetcas14 624
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam Filter
 
A Soft Set-based Co-occurrence for Clustering Web User Transactions
A Soft Set-based Co-occurrence for Clustering Web User TransactionsA Soft Set-based Co-occurrence for Clustering Web User Transactions
A Soft Set-based Co-occurrence for Clustering Web User Transactions
 
Advantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information RetrievalAdvantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information Retrieval
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
G04124041046
G04124041046G04124041046
G04124041046
 

Recently uploaded

Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfNAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfWadeK3
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 

Recently uploaded (20)

Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfNAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 

A Comparison of Supervised Learning Classifiers for Link Discovery

  • 1. A Comparison of Supervised Learning Classi
  • 2. ers for Link Discovery Tommaso Soru and Axel-Cyrille Ngonga Ngomo Agile Knowledge Engineering and Semantic Web Department of Computer Science University of Leipzig Augustusplatz 10, 04109 Leipzig ftsoru,ngongag@informatik.uni-leipzig.de http://aksw.org September 4, 2014
  • 3. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Introduction/1 The 4th Linked Data Web Principle. Include links to other URIs, so that they can discover more things." { Tim Berners-Lee 31B triples in 2011 of which only 3% link dierent datasets 71B triples expected in 2014 T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 4. ers for Link Discovery 2 / 18
  • 5. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Introduction/2 Link Discovery What? Discover new links among resources. How? Using supervised and unsupervised methods. Why? Links are important for data integration, question answering, knowledge extraction. We will focus on supervised machine-learning algorithms. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 6. ers for Link Discovery 3 / 18
  • 7. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Introduction/2 Link Discovery What? Discover new links among resources. How? Using supervised and unsupervised methods. Why? Links are important for data integration, question answering, knowledge extraction. We will focus on supervised machine-learning algorithms. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 8. ers for Link Discovery 3 / 18
  • 9. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Preliminaries Link Discovery. Given two datasets S and T, the general aim of link discovery is to
  • 10. nd the set of resource pairs (s; t) 2 S T such that R(s; t) holds, where R is a given relation such as owl:sameAs or dbp:near. Link Speci
  • 11. cation. A link speci
  • 12. cation is a rule composed by a complex similarity function sim and a threshold that de
  • 13. nes which pairs (s; t) should be linked together: sim(s; t) Main problems 1 Nave approaches demand quadratic time complexity. 2 Ecient algorithms ; accurate link speci
  • 14. cations. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 15. ers for Link Discovery 4 / 18
  • 16. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Preliminaries Link Discovery. Given two datasets S and T, the general aim of link discovery is to
  • 17. nd the set of resource pairs (s; t) 2 S T such that R(s; t) holds, where R is a given relation such as owl:sameAs or dbp:near. Link Speci
  • 18. cation. A link speci
  • 19. cation is a rule composed by a complex similarity function sim and a threshold that de
  • 20. nes which pairs (s; t) should be linked together: sim(s; t) Main problems 1 Nave approaches demand quadratic time complexity. 2 Ecient algorithms ; accurate link speci
  • 21. cations. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 22. ers for Link Discovery 4 / 18
  • 23. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Preliminaries Link Discovery. Given two datasets S and T, the general aim of link discovery is to
  • 24. nd the set of resource pairs (s; t) 2 S T such that R(s; t) holds, where R is a given relation such as owl:sameAs or dbp:near. Link Speci
  • 25. cation. A link speci
  • 26. cation is a rule composed by a complex similarity function sim and a threshold that de
  • 27. nes which pairs (s; t) should be linked together: sim(s; t) Main problems 1 Nave approaches demand quadratic time complexity. 2 Ecient algorithms ; accurate link speci
  • 28. cations. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 29. ers for Link Discovery 4 / 18
  • 30. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Motivation We want to answer these questions. Q1: Which of the paradigms achieves the best F-measures? Q2: Which of the paradigms is most robust against noise? Q3: Which of the methods is the most time-ecient? T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 31. ers for Link Discovery 5 / 18
  • 32. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Motivation We want to answer these questions. Q1: Which of the paradigms achieves the best F-measures? Q2: Which of the paradigms is most robust against noise? Q3: Which of the methods is the most time-ecient? T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 33. ers for Link Discovery 5 / 18
  • 34. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Motivation We want to answer these questions. Q1: Which of the paradigms achieves the best F-measures? Q2: Which of the paradigms is most robust against noise? Q3: Which of the methods is the most time-ecient? T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 35. ers for Link Discovery 5 / 18
  • 36. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Overview/1 Evaluation pipeline Alignment between properties is carried out manually. Perfect mapping (i.e., labels) (s; t) is a positive example i R(s; t) holds. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 37. ers for Link Discovery 6 / 18
  • 38. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Overview/2 Assumptions The complex similarity function sim compares property values. In case of datatype properties: it uses text/numerical/date similarities. object properties: it applies the similarities iteratively. Graph structure has not been considered as a feature per se. Cross-validation has been preferred over semi-supervised learning because it yields more accurate results. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 39. ers for Link Discovery 7 / 18
  • 40. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Evaluation Setup/1 Similarities for string values: Weighted trigram similarity, setting tf-idf scores as weights Weighted edit distance, setting confusion matrices as weights Cosine similarity for numerical values: Logarithmic similarity for date values: a day-based Date similarity T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 41. ers for Link Discovery 8 / 18
  • 42. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Evaluation Setup/2 Linear non-probabilistic classi
  • 43. ers Linear SVM* Polynomial SVM* Linear SVM with Sequential Minimal Optimization Linear Regression Probabilistic classi
  • 44. ers Logistic Regression Nave Bayes Random Tree J48 Neural networks Multilayer Perceptron Rule-based classi
  • 45. ers Decision Table We used classi
  • 46. ers from the Weka library, except (*) from LibSVM. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 47. ers for Link Discovery 9 / 18
  • 48. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Evaluation Setup/3 Datasets D1-D3: synthetic datasets from the Ontology Alignment Evaluation Initiative (OAEI) 2010 Benchmark D4-D6: real datasets from the Benchmark for Entity Resolution, DBS Leipzig D5-D6: datasets having a high level of noise # dataset domain size D1 OAEI-Persons1 personal data 250k D2 OAEI-Persons2 personal data 240k D3 OAEI-Restaurants places 72k D4 DBLP{ACM bibliographic 6M D5 Amazon{GoogleProducts e-commerce 10M D6 ABT{Buy e-commerce 1M T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 49. ers for Link Discovery 10 / 18
  • 50. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/1 F-measure Classi
  • 51. er D1 D2 D3 D4 D5 D6 Linear SVM 99.40% 98.99% 97.75% 97.81% 27.06% 39.18% Linear SMO 100.00% 98.73% 100.00% 92.58% 46.63% 31.39% Polynomial-3 SVM 99.40% 93.76% 98.29% 97.67% 37.28% 31.69% Multilayer Perceptron 99.50% 99.50% 100.00% 97.43% 35.58% 43.49% Logistic Regression 99.90% 98.12% 96.67% 97.71% 40.64% 41.92% Linear Regression 99.30% 96.92% 100.00% 96.36% 37.06% 36.84% Nave Bayes 97.75% 35.05% 95.19% 29.47% 2.92% 11.90% Decision Table 97.98% 100.00% 100.00% 97.66% 42.44% 29.66% Random Tree 97.45% 99.24% 89.89% 96.82% 39.38% 41.03% J48 99.50% 95.56% 98.29% 97.66% 44.28% 31.53% State of the Art 100.00% 100.00% 100.00% 98.20% 62.10% 71.30% F-measure calculated on the class of positive examples. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 52. ers for Link Discovery 11 / 18
  • 53. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/2 Computation runtimes Classi
  • 54. er D1 D2 D3 D4 D5 D6 Linear SVM 7.16 6.93 2.67 63.94 484.29 75.44 Linear SMO 17.07 12.93 3.77 113.40 369.20 37.16 Polynomial-3 SVM 5.67 6.18 2.63 162.82 1,091.10 103.89 Multilayer Perceptron 15.13 16.10 3.40 96.96 376.26 41.68 Logistic Regression 16.11 14.91 4.61 110.12 275.94 38.48 Linear Regression 16.04 16.21 5.02 120.54 497.43 44.50 Nave Bayes 17.34 17.09 4.39 105.31 375.91 43.79 Decision Table 16.68 16.44 3.78 90.99 389.35 48.87 Random Tree 12.02 11.16 2.24 53.67 347.36 34.11 J48 21.31 15.96 6.99 131.57 98.27 38.46 All values in seconds. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 55. ers for Link Discovery 12 / 18
  • 56. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/3 Considerations Some average trends can be suggested, yet no algorithm outperforms all other signi
  • 57. cantly. Multilayer Perceptrons performed best including and excluding noisy datasets. Random Trees seem the fastest approach overall. The dierent approaches seem complementary on their behaviour. Nave Bayes might fail as it considers all features as independent from each other. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 58. ers for Link Discovery 13 / 18
  • 59. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/4 Answers Q1: Which of the paradigms achieves the best F-measures? A1: Multilayer Perceptrons, Linear SVMs, Decision Tables. Q2: Which of the paradigms is most robust against noise? A2: Logistic Regression, Random Trees, Multilayer Perceptrons. Q3: Which of the methods is the most time-ecient? A3: Random Trees, however all approaches scale well. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 60. ers for Link Discovery 14 / 18
  • 61. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/4 Answers Q1: Which of the paradigms achieves the best F-measures? A1: Multilayer Perceptrons, Linear SVMs, Decision Tables. Q2: Which of the paradigms is most robust against noise? A2: Logistic Regression, Random Trees, Multilayer Perceptrons. Q3: Which of the methods is the most time-ecient? A3: Random Trees, however all approaches scale well. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 62. ers for Link Discovery 14 / 18
  • 63. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/4 Answers Q1: Which of the paradigms achieves the best F-measures? A1: Multilayer Perceptrons, Linear SVMs, Decision Tables. Q2: Which of the paradigms is most robust against noise? A2: Logistic Regression, Random Trees, Multilayer Perceptrons. Q3: Which of the methods is the most time-ecient? A3: Random Trees, however all approaches scale well. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 64. ers for Link Discovery 14 / 18
  • 65. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/4 Answers Q1: Which of the paradigms achieves the best F-measures? A1: Multilayer Perceptrons, Linear SVMs, Decision Tables. Q2: Which of the paradigms is most robust against noise? A2: Logistic Regression, Random Trees, Multilayer Perceptrons. Q3: Which of the methods is the most time-ecient? A3: Random Trees, however all approaches scale well. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 66. ers for Link Discovery 14 / 18
  • 67. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Related Work Time-ecient deduplication algorithms (PPJoin+, EDJoin, PassJoin, TrieJoin) LIMES { Link Discovery Framework for Metric Spaces Approaches for learning link speci
  • 68. cations (HYPPO, HR3, EAGLE, ACIDS) Dedicated ecient methods (RDF-AI, REEDED) LinkLion { A Link Repository for the Web of Data The SAIM interface Other link discovery frameworks (SILK, LDIF) Other machine learning frameworks (MARLIN, FEBRL, RAVEN) Other blocking techniques (MultiBlock, KnoFuss) T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 69. ers for Link Discovery 15 / 18
  • 70. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Future Work 1 Integration of Multilayer Perceptrons into the LIMES framework. 2 Use of ensemble learning techniques. 3 Evaluation on a semi-supervised learning setting with few training data. 4 Evaluation using a larger amount of similarity measures. 5 Incorporation of a component based on Statistical Relational Learning. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 71. ers for Link Discovery 16 / 18
  • 72. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Web resources Source code { Batch Learners Evaluation for Link Discovery http://github.com/mommi84/BALLAD Technical report { Batch Learners Evaluation for Link Discovery http://mommi84.github.io/BALLAD The OAEI 2010 Benchmark http://oaei.ontologymatching.org/2010/benchmarks The Benchmark for Entity Resolution, DBS Leipzig http://goo.gl/bvWBjA Weka { Data Mining Software in Java http://www.cs.waikato.ac.nz/ml/weka LibSVM { A Library for Support Vector Machines http://www.csie.ntu.edu.tw/~cjlin/libsvm LIMES { Link Discovery Framework for Metric Spaces http://aksw.org/Projects/LIMES LinkLion { A Link Repository for the Web of Data http://www.linklion.org T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 73. ers for Link Discovery 17 / 18
  • 74. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Thank you for your attention. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 75. ers for Link Discovery 18 / 18