Learning Concept Mappings from Instance Similarity

387 views
295 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
387
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Learning Concept Mappings from Instance Similarity

  1. 1. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Learning Concept Mappings from Instance Similarity Shenghui Wang1 Gwenn Englebienne2 Stefan Schlobach1 1 Vrije Universiteit Amsterdam 2 Universiteit van Amsterdam ISWC 2008
  2. 2. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryOutline 1 Introduction Thesaurus mapping Instance-based techniques 2 Mapping method: classification based on instance similarity Representing concepts and the similarity between them Classification based on instance similarity 3 Experiments and results Experiment setup Results 4 Summary
  3. 3. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryThesaurus mappingThesaurus mapping SemanTic Interoperability To access Cultural Heritage (STITCH) through mappings between thesauri e.g.“plankzeilen” vs. “surfsport” e.g.“griep” vs. “influenza” Scope of the problem: Big thesauri with tens of thousands of concepts Huge collections (e.g., KB: 80km of books in one collection) Heterogeneous (e.g., books, manuscripts, illustrations, etc.) Multi-lingual problem
  4. 4. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryInstance-based techniquesInstance-based techniques: common instance based
  5. 5. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryInstance-based techniquesInstance-based techniques: common instance based
  6. 6. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryInstance-based techniquesInstance-based techniques: common instance based
  7. 7. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryInstance-based techniquesPros and cons Advantages Simple to implement Interesting results Disadvantages Requires sufficient amounts of common instances Only uses part of the available information
  8. 8. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryInstance-based techniquesInstance-based techniques: Instance similarity based
  9. 9. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryInstance-based techniquesInstance-based techniques: Instance similarity based
  10. 10. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryInstance-based techniquesInstance-based techniques: Instance similarity based
  11. 11. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryRepresenting concepts and the similarity between themRepresenting concepts and the similarity between them { Creator Creator Title Term 1: 4 Publisher Bag of words Term 2: 1 Term 3: 0 ... ... Concept 1 Creator Title Title Term 1: 0 Publisher Bag of words Term 2: 3 Term 3: 0 ... ... Creator Publisher Cos. dist. Title Term 1: 2 Bag of words Term 2: 1 Publisher Term 3: 3 ... ... f1 Cos. dist. f2 { Creator f3 Term 1: 2 ... Bag of words Term 2: 0 Creator Term 3: 0 Title ... Cos. dist. Concept 2 Publisher Title ... Term 1: 0 Bag of words Term 2: 4 Creator Term 3: 1 Title ... Publisher Publisher ... Term 1: 4 Bag of words Term 2: 1 Term 3: 1 ... { { { Instance features Concept features Pair features
  12. 12. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryClassification based on instance similarityClassification based on instance similarity Each pair of concepts is treated as a point in a “similarity space” Its position is defined by the features of the pair. The features of the pair are the different measures of similarity between the concepts’ instances. Hypothesis: the label of a point — which represents whether the pair is a positive mapping or negative one — is correlated with the position of this point in this space. With already labelled points and the actual similarity values of concepts involved, it is possible to classify a point, i.e., to give it a right label, based on its location given by the actual similarity values.
  13. 13. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryClassification based on instance similarityThe classifier used: Markov Random Field Let T = { (x(i) , y (i) ) }N be the training set i=1 x(i) ∈ RK , the features y (i) ∈ Y = {positive, negative}, the label The conditional probability of a label given the input is modelled as K 1 p(y (i) |xi , θ) = exp λj φj (y (i) , x(i) ) , (1) Z (xi , θ) j=1 where θ = { λj }K are the weights associated to the feature j=1 functions φ and Z (xi , θ) is a normalisation constant
  14. 14. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryClassification based on instance similarityThe classifier used: Markov Random Field (cont’) The likelihood of the data set for given model parameters p(T |θ) is given by: N p(T |θ) = p(y (i) |x(i) ) (2) i=1 During learning, our objective is to find the most likely values for θ for the given training data. The decision criterion for assigning a label y (i) to a new pair of concepts i is then simply given by: y (i) = argmax p(y |x(i) ) (3) y
  15. 15. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryExperiment setupExperiment setup Two cases: mapping GTT and Brinkman used in Koninklijke Bibliotheek (KB) — Homogeneous collections mapping GTT/Brinkman and GTAA used in Beeld en Geluid (BG) — Heterogeneous collections Evaluation Measures: misclassification rate or error rate 10 fold cross-validation testing on special data sets
  16. 16. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsExperiment I: Feature-similarity based mapping versusexisting methods Are the benefits from feature-similarity of instances in extensional mapping significant when compared to existing methods? Mapping method Error rate Falcon 0.28895 Slex 0.42620 ± 0.049685 Sjacc80 0.44643 ± 0.059524 Sbag 0.57380 ± 0.049685 {f1 , . . . f28 } (our new approach) 0.20491 ± 0.026158 Table: Comparison between existing methods and similarities-based mapping, in KB case
  17. 17. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsExperiment I: Feature-similarity based mapping versusexisting methods Are the benefits from feature-similarity of instances in extensional mapping significant when compared to existing methods? Mapping method Error rate Falcon 0.28895 Slex 0.42620 ± 0.049685 Sjacc80 0.44643 ± 0.059524 Sbag 0.57380 ± 0.049685 {f1 , . . . f28 } (our new approach) 0.20491 ± 0.026158 Table: Comparison between existing methods and similarities-based mapping, in KB case
  18. 18. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsExperiment II: Extending to corpora without joint instances Can our approach be applied to corpora for which there are no doubly annotated instances, i.e., for which there are no joint instances? Collections Testing set Error rate Joint instances golden standard 0.20491 ± 0.026158 (original KB corpus) lexical only 0.137871 No joint instances golden standard 0.28378 ± 0.026265 (double instances removed) lexical only 0.161867 Table: Comparison between classifiers using joint and disjoint instances, in KB case
  19. 19. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsExperiment II: Extending to corpora without joint instances Can our approach be applied to corpora for which there are no doubly annotated instances, i.e., for which there are no joint instances? Collections Testing set Error rate Joint instances golden standard 0.20491 ± 0.026158 (original KB corpus) lexical only 0.137871 No joint instances golden standard 0.28378 ± 0.026265 (double instances removed) lexical only 0.161867 Table: Comparison between classifiers using joint and disjoint instances, in KB case
  20. 20. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsExperiment III: Extending to heterogeneous collections Can our approach be applied to corpora in which instances are described in a heterogeneous way? Feature selection exhaustive combination by calculating the similarity between all possible pairs of fields require more training data to avoid over-fitting manual selection of corresponding metadata field pairs mutual information to select the most informative field pairs
  21. 21. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsExperiment III: Extending to heterogeneous collections Can our approach be applied to corpora in which instances are described in a heterogeneous way? Feature selection exhaustive combination by calculating the similarity between all possible pairs of fields require more training data to avoid over-fitting manual selection of corresponding metadata field pairs mutual information to select the most informative field pairs
  22. 22. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsExperiment III: Extending to heterogeneous collections Can our approach be applied to corpora in which instances are described in a heterogeneous way? Feature selection exhaustive combination by calculating the similarity between all possible pairs of fields require more training data to avoid over-fitting manual selection of corresponding metadata field pairs mutual information to select the most informative field pairs
  23. 23. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsExperiment III: Extending to heterogeneous collections Can our approach be applied to corpora in which instances are described in a heterogeneous way? Feature selection exhaustive combination by calculating the similarity between all possible pairs of fields require more training data to avoid over-fitting manual selection of corresponding metadata field pairs mutual information to select the most informative field pairs
  24. 24. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsExperiment III: Extending to heterogeneous collections Can our approach be applied to corpora in which instances are described in a heterogeneous way? Feature selection exhaustive combination by calculating the similarity between all possible pairs of fields require more training data to avoid over-fitting manual selection of corresponding metadata field pairs mutual information to select the most informative field pairs
  25. 25. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsFeature selection Can we maintain high mapping quality when features are selected (semi)-automatically? Thesaurus Feature selection Error rate manual selection 0.11290 ± 0.025217 GTAA vs. Brinkman mutual information 0.09355 ± 0.044204 exhaustive 0.10323 ± 0.031533 manual selection 0.10000 ± 0.050413 GTAA vs. GTT mutual information 0.07826 ± 0.044904 exhaustive 0.11304 ± 0.046738 Table: Comparison of the performance with different methods of feature selection, using non-lexical dataset
  26. 26. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsFeature selection Can we maintain high mapping quality when features are selected (semi)-automatically? Thesaurus Feature selection Error rate manual selection 0.11290 ± 0.025217 GTAA vs. Brinkman mutual information 0.09355 ± 0.044204 exhaustive 0.10323 ± 0.031533 manual selection 0.10000 ± 0.050413 GTAA vs. GTT mutual information 0.07826 ± 0.044904 exhaustive 0.11304 ± 0.046738 Table: Comparison of the performance with different methods of feature selection, using non-lexical dataset
  27. 27. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsTraining set manually built golden standard (751) lexical seeding background seeding Thesauri lexical non-lexical GTAA vs. GTT 2720 116 GTAA vs. Brinkman 1372 323 Table: Numbers of positive examples in the training sets
  28. 28. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsTraining set manually built golden standard (751) lexical seeding background seeding Thesauri lexical non-lexical GTAA vs. GTT 2720 116 GTAA vs. Brinkman 1372 323 Table: Numbers of positive examples in the training sets
  29. 29. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsBias in the training sets Thesauri Training set Test set Error rate non-lexical non-lexical 0.09355 ± 0.044204 GTAA vs. lexical non-lexical 0.11501 Brinkman non-lexical lexical 0.07124 lexical lexical 0.04871 ± 0.029911 non-lexical non-lexical 0.07826 ± 0.044904 GTAA vs. lexical non-lexical 0.098712 GTT non-lexical lexical 0.088603 lexical lexical 0.06195 ± 0.008038 Table: Comparison using different datasets (feature selected using mutual information)
  30. 30. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsBias in the training sets Thesauri Training set Test set Error rate non-lexical non-lexical 0.09355 ± 0.044204 GTAA vs. lexical non-lexical 0.11501 Brinkman non-lexical lexical 0.07124 lexical lexical 0.04871 ± 0.029911 non-lexical non-lexical 0.07826 ± 0.044904 GTAA vs. lexical non-lexical 0.098712 GTT non-lexical lexical 0.088603 lexical lexical 0.06195 ± 0.008038 Table: Comparison using different datasets (feature selected using mutual information)
  31. 31. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsPositive-negative ratios in the training sets 1 1 error rate error rate 0.9 0.9 F−measure F−measure pos pos Precision 0.8 Precisionpos 0.8 pos Recall pos Recall pos 0.7 0.7 0.6 0.6 Measures Measures 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 Ratio between positive and negative examples in the training set Ratio between positive and negative examples in the training set (a) testing on 1:1 data (b) testing on 1:1000 data Figure: The influence of positive-negative ratios — Brinkman vs. GTAA
  32. 32. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsPositive-negative ratios in the training sets (cont’) In practice, the training data should be chosen so as to contain a representative ratio of positive and negative examples, while still providing enough material for the classifier to have good predictive capacity on both types of examples.
  33. 33. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsExperiment IV: Qualitative use of feature weights The value of learning results, λj , reflects the importance of the feature fj in the process of determining similarity (mappings) between concepts. KB fields BG fields kb:title bg:subject kb:abstract bg:subject kb:annotation bg:LOCATIES kb:annotation bg:SUBSIDIE kb:creator bg:contributor kb:creator bg:PERSOONSNAMEN kb:Date bg:OPNAMEDATUM kb:dateCopyrighted bg:date kb:description bg:subject kb:publisher bg:NAMEN kb:temporal bg:date
  34. 34. Introduction Mapping method: classification based on instance similarity Experiments and results SummaryResultsExperiment IV: Qualitative use of feature weights The value of learning results, λj , reflects the importance of the feature fj in the process of determining similarity (mappings) between concepts. KB fields BG fields kb:title bg:subject kb:abstract bg:subject kb:annotation bg:LOCATIES kb:annotation bg:SUBSIDIE kb:creator bg:contributor kb:creator bg:PERSOONSNAMEN kb:Date bg:OPNAMEDATUM kb:dateCopyrighted bg:date kb:description bg:subject kb:publisher bg:NAMEN kb:temporal bg:date
  35. 35. Introduction Mapping method: classification based on instance similarity Experiments and results SummarySummary We use a machine learning method to automatically use the similarity between instances to determine mappings between concepts from different thesauri/ontologies. Enables mappings between thesauri used for very heterogeneous collections Does not require dually annotated instances Not limited by the language barrier A contribution to the field of meta-data mapping In the future More heterogeneous collections Smarter measures of similarity between instance metadata More similarity dimensions between concepts, e.g., lexical, structural
  36. 36. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Thank you

×