Your SlideShare is downloading. ×
SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

106
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
106
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web DataSamur Araujo, DucThanh Tran, Arjen de Vries,Jan Hidders, Daniel SchwabeDelft University of TechnologyWebDB 2012 Delft University of Technology
  • 2. Me You SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 2
  • 3. AppleMe SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 3
  • 4. YouSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 4
  • 5. ? YouAmbiguous SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 5
  • 6. Me SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 6
  • 7. Me SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 7
  • 8. Me SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 8
  • 9. YouSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 9
  • 10. My Apple Your AppleSpherical Shape Round Shape Red Color Green Color Eatable Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 10
  • 11. My Apple Your Apple Shape Shape Color Color Eatable Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 11
  • 12. My Apple Your AppleSpherical Shape Round Shape Red Color Fruit Green Color Eatable Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 12
  • 13. My Apple Your Apple SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 13
  • 14. Instance MatchingSource Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 14
  • 15. “Instance matching uses adirect comparison paradigm”. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 15
  • 16. SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 16
  • 17. Is your Apple like my Apple?Source Humm.. Maybe! Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 17
  • 18. Homogenous data and schema. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 18
  • 19. The source and targetdescriptions overlap. Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 19
  • 20. Syntactic OverlapPopulation = TotalPopulation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 20
  • 21. Semantic OverlapPopulation = Num_Inhabitants SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 21
  • 22. Web of Data: heterogeneousdata and schema SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 22
  • 23. None or limited overlapbetween schemas Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 23
  • 24. Instances do not instantiatethe schema, properly. Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 24
  • 25. Apple Nutritional BotanicalInformation Information SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 25
  • 26. “Direct comparison paradigmdoes not apply”. Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 26
  • 27. AppleMe SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 27
  • 28. Apple OrangeMe Pineapple SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 28
  • 29. AppleMe Orange You Pineapple SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 29
  • 30. YouSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 32
  • 31. FoodSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 34
  • 32. SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 35
  • 33. Eatable FoodSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 36
  • 34. Source SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 37
  • 35. My Apple Your Apple My Orange Your OrangeMy Pineapple Your Pineapple SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 38
  • 36. SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 39
  • 37. “We use a class-baseddisambiguation paradigm …” SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 40
  • 38. “We use a class-baseddisambiguation paradigm …”“… when there is no overlapbetween schemas.” SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 41
  • 39. Instance Matching with SERIMISource Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 42
  • 40. Instance Representation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 43
  • 41. Instance Representation Predicate Instance Value SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 44
  • 42. Instance Representation shape Apple1 Round title Apple1 Apple color Apple1 Red category Apple1 Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 45
  • 43. Instance Representation shape Apple1 Round title Apple1 Apple color Apple1 Red category Apple1 Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 46
  • 44. Instance Representation shape Apple1 Round title Apple1 Apple color Apple1 Red category Apple1 Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 47
  • 45. Instance Representation [P(hi), D(hi), O(hi), T(hi)] SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 48
  • 46. Step 1: Cluster the source Cars Source Fruits Companies SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 49
  • 47. Step 2: Blocking Key Selection Key Selection Sourceinstances SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 51
  • 48. Step 2: Blocking Key Selection key Key key Selection key Sourceinstances SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 52
  • 49. Step 2: Blocking Key Selection e.g.Title key Key key Selection key Sourceinstances SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 53
  • 50. Step 3: Pseudo-Homonyms Builder Title=apple Pseudo- Title=orange Homonyms Title=pineapple Builder Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 54
  • 51. Step 3: Pseudo-Homonyms Builder Everything Target called Apple Pseudo- Homonyms Builder Source instances Target Pseudo-homonyms sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 55
  • 52. Step 4: Class-based disambiguation Target Disambiguation Class-based Disambiguato rPseudo-homonyms sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 56
  • 53. Step 4: Class-based disambiguation Target Target Disambiguation Class-based Disambiguato r Source instances Pseudo-homonymsPseudo-homonyms sets sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 57
  • 54. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 58
  • 55. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 59
  • 56. Step 4: Class-based disambiguation h11 h21 h31instances h12 h22 h32 h13 h33 h14 H1 H2 H3 pseudo-homonym sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 60
  • 57. Step 4: Class-based disambiguation [P(hi11), D(hi11), O(hi11), T(hi11)] h11 h21 h31instances h12 h22 h32 h13 h33 h14 H1 H2 H3 pseudo-homonym sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 61
  • 58. Instance Representation shape Apple1 Round title Apple1 Apple color Apple1 Red category Apple1 Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 62
  • 59. Step 4: Class-based disambiguation h11 h21 h31 0.98 0.95 0.94instances h12 h22 h32 0.32 0.53 0.91 h13 h33 0.32 0.87 h14 0.76 H1 H2 H3 H1 H2 H3 pseudo-homonym sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 63
  • 60. Step 4: Class-based disambiguationh11 h21 h31h12 h22 h32h13 h33h14H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 64
  • 61. Step 4: Class-based disambiguation [P(hi11), D(hi11), O(hi11), T(hi11)]h11 h21 h31h12 h22 h32h13 h33h14H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 65
  • 62. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 66
  • 63. Step 4: Class-based disambiguationh11 h21 h31 0.98 h21 h31h12 h22 h32 h12 h22 h32h13 h33 h13 h33h14 h14H1 H2 H3 H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 67
  • 64. Step 4: Class-based disambiguation0.98 0.95 0.940.32 0.53 0.910.32 0.870.76H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 68
  • 65. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 69
  • 66. Step 4: Class-based disambiguation0.98 0.95 0.94 TOP-K or ThresholdH1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 70
  • 67. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 71
  • 68. Step 4: Class-based disambiguation Target Disambiguation Class-based Disambiguato r Source instances Pseudo-homonyms sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 72
  • 69. Experiment• Ontology AlignmentEvaluation Initiative (OAEI 2010)• Collections: the life science (LS) collection (DBPedia, Sider, Drugbank, LinkedCT, Dailymed, TCM, and Diseasome) and the Person-Restaurant (PR)• 20 gigabytes of data, millions of triples.• We compared SERIMI to ObjectCoref and RiMON• Precision, Recall and F1 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 73
  • 70. Results SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 74
  • 71. Results SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 75
  • 72. Results SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 76
  • 73. SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 77
  • 74. Results SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 80
  • 75. Step 4: Class-based disambiguation0.98 0.95 0.94 TOP-K or ThresholdH1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 81
  • 76. Results for Top-K 1.00 0.90 0.80 0.70 0.60 0.50 Top-1F1 0.40 Top-2 0.30 Top-5 0.20 Top-10 0.10 0.00 Sider-Daily. Sider-Drug. Drug.-Sider P11-P12 Dataset Pair SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 82
  • 77. Results for δ threshold 1.00 0.90 0.80 0.70 0.60 δ >= δm 0.50 δ = 1.0F1 0.40 δ >= 0.95 δ >= 0.90 0.30 δ >= 0.85 0.20 0.10 0.00 Sider-Daily. Sider-Drug. Drug.-Sider P11-P12 Dataset Pair SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 83
  • 78. Conclusion• SERIMI is complementary approach to direct-match based instance matching tools.• SERIMI is recommended for heterogeneous data where there is no overlap between schemas.• It is recommended for multi-class disambiguation. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 84
  • 79. THANK YOU!• Samur Araujos.f.cardosodearaujo@tudelft.nl SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 85