SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

  • 102 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
102
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
1
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web DataSamur Araujo, DucThanh Tran, Arjen de Vries,Jan Hidders, Daniel SchwabeDelft University of TechnologyWebDB 2012 Delft University of Technology
  • 2. Me You SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 2
  • 3. AppleMe SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 3
  • 4. YouSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 4
  • 5. ? YouAmbiguous SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 5
  • 6. Me SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 6
  • 7. Me SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 7
  • 8. Me SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 8
  • 9. YouSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 9
  • 10. My Apple Your AppleSpherical Shape Round Shape Red Color Green Color Eatable Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 10
  • 11. My Apple Your Apple Shape Shape Color Color Eatable Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 11
  • 12. My Apple Your AppleSpherical Shape Round Shape Red Color Fruit Green Color Eatable Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 12
  • 13. My Apple Your Apple SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 13
  • 14. Instance MatchingSource Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 14
  • 15. “Instance matching uses adirect comparison paradigm”. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 15
  • 16. SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 16
  • 17. Is your Apple like my Apple?Source Humm.. Maybe! Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 17
  • 18. Homogenous data and schema. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 18
  • 19. The source and targetdescriptions overlap. Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 19
  • 20. Syntactic OverlapPopulation = TotalPopulation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 20
  • 21. Semantic OverlapPopulation = Num_Inhabitants SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 21
  • 22. Web of Data: heterogeneousdata and schema SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 22
  • 23. None or limited overlapbetween schemas Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 23
  • 24. Instances do not instantiatethe schema, properly. Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 24
  • 25. Apple Nutritional BotanicalInformation Information SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 25
  • 26. “Direct comparison paradigmdoes not apply”. Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 26
  • 27. AppleMe SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 27
  • 28. Apple OrangeMe Pineapple SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 28
  • 29. AppleMe Orange You Pineapple SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 29
  • 30. YouSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 32
  • 31. FoodSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 34
  • 32. SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 35
  • 33. Eatable FoodSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 36
  • 34. Source SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 37
  • 35. My Apple Your Apple My Orange Your OrangeMy Pineapple Your Pineapple SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 38
  • 36. SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 39
  • 37. “We use a class-baseddisambiguation paradigm …” SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 40
  • 38. “We use a class-baseddisambiguation paradigm …”“… when there is no overlapbetween schemas.” SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 41
  • 39. Instance Matching with SERIMISource Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 42
  • 40. Instance Representation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 43
  • 41. Instance Representation Predicate Instance Value SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 44
  • 42. Instance Representation shape Apple1 Round title Apple1 Apple color Apple1 Red category Apple1 Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 45
  • 43. Instance Representation shape Apple1 Round title Apple1 Apple color Apple1 Red category Apple1 Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 46
  • 44. Instance Representation shape Apple1 Round title Apple1 Apple color Apple1 Red category Apple1 Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 47
  • 45. Instance Representation [P(hi), D(hi), O(hi), T(hi)] SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 48
  • 46. Step 1: Cluster the source Cars Source Fruits Companies SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 49
  • 47. Step 2: Blocking Key Selection Key Selection Sourceinstances SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 51
  • 48. Step 2: Blocking Key Selection key Key key Selection key Sourceinstances SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 52
  • 49. Step 2: Blocking Key Selection e.g.Title key Key key Selection key Sourceinstances SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 53
  • 50. Step 3: Pseudo-Homonyms Builder Title=apple Pseudo- Title=orange Homonyms Title=pineapple Builder Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 54
  • 51. Step 3: Pseudo-Homonyms Builder Everything Target called Apple Pseudo- Homonyms Builder Source instances Target Pseudo-homonyms sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 55
  • 52. Step 4: Class-based disambiguation Target Disambiguation Class-based Disambiguato rPseudo-homonyms sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 56
  • 53. Step 4: Class-based disambiguation Target Target Disambiguation Class-based Disambiguato r Source instances Pseudo-homonymsPseudo-homonyms sets sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 57
  • 54. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 58
  • 55. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 59
  • 56. Step 4: Class-based disambiguation h11 h21 h31instances h12 h22 h32 h13 h33 h14 H1 H2 H3 pseudo-homonym sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 60
  • 57. Step 4: Class-based disambiguation [P(hi11), D(hi11), O(hi11), T(hi11)] h11 h21 h31instances h12 h22 h32 h13 h33 h14 H1 H2 H3 pseudo-homonym sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 61
  • 58. Instance Representation shape Apple1 Round title Apple1 Apple color Apple1 Red category Apple1 Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 62
  • 59. Step 4: Class-based disambiguation h11 h21 h31 0.98 0.95 0.94instances h12 h22 h32 0.32 0.53 0.91 h13 h33 0.32 0.87 h14 0.76 H1 H2 H3 H1 H2 H3 pseudo-homonym sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 63
  • 60. Step 4: Class-based disambiguationh11 h21 h31h12 h22 h32h13 h33h14H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 64
  • 61. Step 4: Class-based disambiguation [P(hi11), D(hi11), O(hi11), T(hi11)]h11 h21 h31h12 h22 h32h13 h33h14H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 65
  • 62. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 66
  • 63. Step 4: Class-based disambiguationh11 h21 h31 0.98 h21 h31h12 h22 h32 h12 h22 h32h13 h33 h13 h33h14 h14H1 H2 H3 H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 67
  • 64. Step 4: Class-based disambiguation0.98 0.95 0.940.32 0.53 0.910.32 0.870.76H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 68
  • 65. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 69
  • 66. Step 4: Class-based disambiguation0.98 0.95 0.94 TOP-K or ThresholdH1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 70
  • 67. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 71
  • 68. Step 4: Class-based disambiguation Target Disambiguation Class-based Disambiguato r Source instances Pseudo-homonyms sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 72
  • 69. Experiment• Ontology AlignmentEvaluation Initiative (OAEI 2010)• Collections: the life science (LS) collection (DBPedia, Sider, Drugbank, LinkedCT, Dailymed, TCM, and Diseasome) and the Person-Restaurant (PR)• 20 gigabytes of data, millions of triples.• We compared SERIMI to ObjectCoref and RiMON• Precision, Recall and F1 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 73
  • 70. Results SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 74
  • 71. Results SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 75
  • 72. Results SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 76
  • 73. SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 77
  • 74. Results SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 80
  • 75. Step 4: Class-based disambiguation0.98 0.95 0.94 TOP-K or ThresholdH1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 81
  • 76. Results for Top-K 1.00 0.90 0.80 0.70 0.60 0.50 Top-1F1 0.40 Top-2 0.30 Top-5 0.20 Top-10 0.10 0.00 Sider-Daily. Sider-Drug. Drug.-Sider P11-P12 Dataset Pair SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 82
  • 77. Results for δ threshold 1.00 0.90 0.80 0.70 0.60 δ >= δm 0.50 δ = 1.0F1 0.40 δ >= 0.95 δ >= 0.90 0.30 δ >= 0.85 0.20 0.10 0.00 Sider-Daily. Sider-Drug. Drug.-Sider P11-P12 Dataset Pair SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 83
  • 78. Conclusion• SERIMI is complementary approach to direct-match based instance matching tools.• SERIMI is recommended for heterogeneous data where there is no overlap between schemas.• It is recommended for multi-class disambiguation. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 84
  • 79. THANK YOU!• Samur Araujos.f.cardosodearaujo@tudelft.nl SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 85