SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web DataSamur Araujo, DucThanh Tran,...
Me                                                      You     SERIMI: Class-based Disambiguation for Effective     Insta...
AppleMe     SERIMI: Class-based Disambiguation for Effective     Instance Matching over Heterogeneous Web Data      3
YouSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data            4
?                                                       YouAmbiguous    SERIMI: Class-based Disambiguation for Effective  ...
Me     SERIMI: Class-based Disambiguation for Effective     Instance Matching over Heterogeneous Web Data      6
Me     SERIMI: Class-based Disambiguation for Effective     Instance Matching over Heterogeneous Web Data      7
Me     SERIMI: Class-based Disambiguation for Effective     Instance Matching over Heterogeneous Web Data      8
YouSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data            9
My Apple                                                       Your AppleSpherical Shape                                  ...
My Apple                                                    Your Apple   Shape                                            ...
My Apple                                                       Your AppleSpherical Shape                                  ...
My Apple                                   Your Apple      SERIMI: Class-based Disambiguation for Effective      Instance ...
Instance MatchingSource                                                   Target          SERIMI: Class-based Disambiguati...
“Instance matching uses adirect comparison paradigm”.        SERIMI: Class-based Disambiguation for Effective        Insta...
SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data      16
Is your Apple like my                        Apple?Source                                      Humm..                     ...
Homogenous data and schema.        SERIMI: Class-based Disambiguation for Effective        Instance Matching over Heteroge...
The source and targetdescriptions overlap.                                                     Source   Target        SERI...
Syntactic OverlapPopulation = TotalPopulation         SERIMI: Class-based Disambiguation for Effective         Instance Ma...
Semantic OverlapPopulation = Num_Inhabitants         SERIMI: Class-based Disambiguation for Effective         Instance Mat...
Web of Data: heterogeneousdata and schema        SERIMI: Class-based Disambiguation for Effective        Instance Matching...
None or limited overlapbetween schemas                                  Source                 Target      SERIMI: Class-b...
Instances do not instantiatethe schema, properly.                                  Source                 Target      SERI...
Apple Nutritional                            BotanicalInformation                            Information        SERIMI: Cl...
“Direct comparison paradigmdoes not apply”.                                    Source                 Target        SERIMI...
AppleMe     SERIMI: Class-based Disambiguation for Effective     Instance Matching over Heterogeneous Web Data      27
Apple                              OrangeMe                              Pineapple     SERIMI: Class-based Disambiguation ...
AppleMe    Orange                                                                You      Pineapple             SERIMI: Cl...
YouSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data      32
FoodSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data      34
SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data      35
Eatable                                                   FoodSERIMI: Class-based Disambiguation for EffectiveInstance Mat...
Source SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data      37
My Apple                                    Your Apple My Orange                                   Your OrangeMy Pineapple...
SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data      39
“We use a class-baseddisambiguation paradigm …”        SERIMI: Class-based Disambiguation for Effective        Instance Ma...
“We use a class-baseddisambiguation paradigm …”“… when there is no overlapbetween schemas.”        SERIMI: Class-based Dis...
Instance Matching with SERIMISource                                                   Target          SERIMI: Class-based ...
Instance Representation          SERIMI: Class-based Disambiguation for Effective          Instance Matching over Heteroge...
Instance Representation              Predicate   Instance                             Value              SERIMI: Class-bas...
Instance Representation           shape  Apple1                            Round            title  Apple1                 ...
Instance Representation           shape  Apple1                            Round            title  Apple1                 ...
Instance Representation           shape  Apple1                            Round            title  Apple1                 ...
Instance Representation      [P(hi), D(hi), O(hi), T(hi)]               SERIMI: Class-based Disambiguation for Effective  ...
Step 1: Cluster the source                                                              Cars Source                       ...
Step 2: Blocking Key Selection              Key            Selection Sourceinstances                        SERIMI: Class-...
Step 2: Blocking Key Selection                                                       key              Key                 ...
Step 2: Blocking Key Selection                                                                           e.g.Title        ...
Step 3: Pseudo-Homonyms Builder   Title=apple                     Pseudo-   Title=orange                   Homonyms   Titl...
Step 3: Pseudo-Homonyms Builder                                                                Everything                 ...
Step 4: Class-based disambiguation    Target                  Disambiguation                      Class-based             ...
Step 4: Class-based disambiguation                                                                                Target  ...
Step 4: Class-based disambiguation          SERIMI: Class-based Disambiguation for Effective          Instance Matching ov...
Step 4: Class-based disambiguation          SERIMI: Class-based Disambiguation for Effective          Instance Matching ov...
Step 4: Class-based disambiguation            h11   h21   h31instances            h12   h22   h32            h13         h...
Step 4: Class-based disambiguation                  [P(hi11), D(hi11), O(hi11), T(hi11)]            h11     h21    h31inst...
Instance Representation           shape  Apple1                            Round            title  Apple1                 ...
Step 4: Class-based disambiguation            h11   h21   h31                     0.98         0.95         0.94instances ...
Step 4: Class-based disambiguationh11   h21   h31h12   h22   h32h13         h33h14H1    H2    H3                  SERIMI: ...
Step 4: Class-based disambiguation      [P(hi11), D(hi11), O(hi11), T(hi11)]h11     h21    h31h12     h22    h32h13       ...
Step 4: Class-based disambiguation          SERIMI: Class-based Disambiguation for Effective          Instance Matching ov...
Step 4: Class-based disambiguationh11   h21   h31    0.98           h21         h31h12   h22   h32      h12          h22  ...
Step 4: Class-based disambiguation0.98   0.95   0.940.32   0.53   0.910.32          0.870.76H1      H2    H3              ...
Step 4: Class-based disambiguation          SERIMI: Class-based Disambiguation for Effective          Instance Matching ov...
Step 4: Class-based disambiguation0.98   0.95   0.94                           TOP-K or ThresholdH1      H2    H3         ...
Step 4: Class-based disambiguation          SERIMI: Class-based Disambiguation for Effective          Instance Matching ov...
Step 4: Class-based disambiguation                                                                       Target         Di...
Experiment• Ontology AlignmentEvaluation Initiative (OAEI 2010)• Collections: the life science (LS) collection (DBPedia, S...
Results          SERIMI: Class-based Disambiguation for Effective          Instance Matching over Heterogeneous Web Data  ...
Results          SERIMI: Class-based Disambiguation for Effective          Instance Matching over Heterogeneous Web Data  ...
Results          SERIMI: Class-based Disambiguation for Effective          Instance Matching over Heterogeneous Web Data  ...
SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data      77
Results          SERIMI: Class-based Disambiguation for Effective          Instance Matching over Heterogeneous Web Data  ...
Step 4: Class-based disambiguation0.98   0.95   0.94                           TOP-K or ThresholdH1      H2    H3         ...
Results for Top-K     1.00     0.90     0.80     0.70     0.60     0.50                                                   ...
Results for δ threshold     1.00     0.90     0.80     0.70     0.60                                                      ...
Conclusion• SERIMI is complementary approach to direct-match based  instance matching tools.• SERIMI is recommended for he...
THANK YOU!• Samur Araujos.f.cardosodearaujo@tudelft.nl     SERIMI: Class-based Disambiguation for Effective      Instance ...
Upcoming SlideShare
Loading in …5
×

SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

196 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
196
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

  1. 1. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web DataSamur Araujo, DucThanh Tran, Arjen de Vries,Jan Hidders, Daniel SchwabeDelft University of TechnologyWebDB 2012 Delft University of Technology
  2. 2. Me You SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 2
  3. 3. AppleMe SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 3
  4. 4. YouSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 4
  5. 5. ? YouAmbiguous SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 5
  6. 6. Me SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 6
  7. 7. Me SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 7
  8. 8. Me SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 8
  9. 9. YouSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 9
  10. 10. My Apple Your AppleSpherical Shape Round Shape Red Color Green Color Eatable Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 10
  11. 11. My Apple Your Apple Shape Shape Color Color Eatable Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 11
  12. 12. My Apple Your AppleSpherical Shape Round Shape Red Color Fruit Green Color Eatable Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 12
  13. 13. My Apple Your Apple SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 13
  14. 14. Instance MatchingSource Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 14
  15. 15. “Instance matching uses adirect comparison paradigm”. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 15
  16. 16. SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 16
  17. 17. Is your Apple like my Apple?Source Humm.. Maybe! Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 17
  18. 18. Homogenous data and schema. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 18
  19. 19. The source and targetdescriptions overlap. Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 19
  20. 20. Syntactic OverlapPopulation = TotalPopulation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 20
  21. 21. Semantic OverlapPopulation = Num_Inhabitants SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 21
  22. 22. Web of Data: heterogeneousdata and schema SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 22
  23. 23. None or limited overlapbetween schemas Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 23
  24. 24. Instances do not instantiatethe schema, properly. Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 24
  25. 25. Apple Nutritional BotanicalInformation Information SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 25
  26. 26. “Direct comparison paradigmdoes not apply”. Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 26
  27. 27. AppleMe SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 27
  28. 28. Apple OrangeMe Pineapple SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 28
  29. 29. AppleMe Orange You Pineapple SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 29
  30. 30. YouSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 32
  31. 31. FoodSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 34
  32. 32. SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 35
  33. 33. Eatable FoodSERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 36
  34. 34. Source SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 37
  35. 35. My Apple Your Apple My Orange Your OrangeMy Pineapple Your Pineapple SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 38
  36. 36. SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 39
  37. 37. “We use a class-baseddisambiguation paradigm …” SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 40
  38. 38. “We use a class-baseddisambiguation paradigm …”“… when there is no overlapbetween schemas.” SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 41
  39. 39. Instance Matching with SERIMISource Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 42
  40. 40. Instance Representation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 43
  41. 41. Instance Representation Predicate Instance Value SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 44
  42. 42. Instance Representation shape Apple1 Round title Apple1 Apple color Apple1 Red category Apple1 Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 45
  43. 43. Instance Representation shape Apple1 Round title Apple1 Apple color Apple1 Red category Apple1 Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 46
  44. 44. Instance Representation shape Apple1 Round title Apple1 Apple color Apple1 Red category Apple1 Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 47
  45. 45. Instance Representation [P(hi), D(hi), O(hi), T(hi)] SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 48
  46. 46. Step 1: Cluster the source Cars Source Fruits Companies SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 49
  47. 47. Step 2: Blocking Key Selection Key Selection Sourceinstances SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 51
  48. 48. Step 2: Blocking Key Selection key Key key Selection key Sourceinstances SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 52
  49. 49. Step 2: Blocking Key Selection e.g.Title key Key key Selection key Sourceinstances SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 53
  50. 50. Step 3: Pseudo-Homonyms Builder Title=apple Pseudo- Title=orange Homonyms Title=pineapple Builder Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 54
  51. 51. Step 3: Pseudo-Homonyms Builder Everything Target called Apple Pseudo- Homonyms Builder Source instances Target Pseudo-homonyms sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 55
  52. 52. Step 4: Class-based disambiguation Target Disambiguation Class-based Disambiguato rPseudo-homonyms sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 56
  53. 53. Step 4: Class-based disambiguation Target Target Disambiguation Class-based Disambiguato r Source instances Pseudo-homonymsPseudo-homonyms sets sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 57
  54. 54. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 58
  55. 55. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 59
  56. 56. Step 4: Class-based disambiguation h11 h21 h31instances h12 h22 h32 h13 h33 h14 H1 H2 H3 pseudo-homonym sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 60
  57. 57. Step 4: Class-based disambiguation [P(hi11), D(hi11), O(hi11), T(hi11)] h11 h21 h31instances h12 h22 h32 h13 h33 h14 H1 H2 H3 pseudo-homonym sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 61
  58. 58. Instance Representation shape Apple1 Round title Apple1 Apple color Apple1 Red category Apple1 Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 62
  59. 59. Step 4: Class-based disambiguation h11 h21 h31 0.98 0.95 0.94instances h12 h22 h32 0.32 0.53 0.91 h13 h33 0.32 0.87 h14 0.76 H1 H2 H3 H1 H2 H3 pseudo-homonym sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 63
  60. 60. Step 4: Class-based disambiguationh11 h21 h31h12 h22 h32h13 h33h14H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 64
  61. 61. Step 4: Class-based disambiguation [P(hi11), D(hi11), O(hi11), T(hi11)]h11 h21 h31h12 h22 h32h13 h33h14H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 65
  62. 62. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 66
  63. 63. Step 4: Class-based disambiguationh11 h21 h31 0.98 h21 h31h12 h22 h32 h12 h22 h32h13 h33 h13 h33h14 h14H1 H2 H3 H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 67
  64. 64. Step 4: Class-based disambiguation0.98 0.95 0.940.32 0.53 0.910.32 0.870.76H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 68
  65. 65. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 69
  66. 66. Step 4: Class-based disambiguation0.98 0.95 0.94 TOP-K or ThresholdH1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 70
  67. 67. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 71
  68. 68. Step 4: Class-based disambiguation Target Disambiguation Class-based Disambiguato r Source instances Pseudo-homonyms sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 72
  69. 69. Experiment• Ontology AlignmentEvaluation Initiative (OAEI 2010)• Collections: the life science (LS) collection (DBPedia, Sider, Drugbank, LinkedCT, Dailymed, TCM, and Diseasome) and the Person-Restaurant (PR)• 20 gigabytes of data, millions of triples.• We compared SERIMI to ObjectCoref and RiMON• Precision, Recall and F1 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 73
  70. 70. Results SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 74
  71. 71. Results SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 75
  72. 72. Results SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 76
  73. 73. SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data 77
  74. 74. Results SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 80
  75. 75. Step 4: Class-based disambiguation0.98 0.95 0.94 TOP-K or ThresholdH1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 81
  76. 76. Results for Top-K 1.00 0.90 0.80 0.70 0.60 0.50 Top-1F1 0.40 Top-2 0.30 Top-5 0.20 Top-10 0.10 0.00 Sider-Daily. Sider-Drug. Drug.-Sider P11-P12 Dataset Pair SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 82
  77. 77. Results for δ threshold 1.00 0.90 0.80 0.70 0.60 δ >= δm 0.50 δ = 1.0F1 0.40 δ >= 0.95 δ >= 0.90 0.30 δ >= 0.85 0.20 0.10 0.00 Sider-Daily. Sider-Drug. Drug.-Sider P11-P12 Dataset Pair SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 83
  78. 78. Conclusion• SERIMI is complementary approach to direct-match based instance matching tools.• SERIMI is recommended for heterogeneous data where there is no overlap between schemas.• It is recommended for multi-class disambiguation. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 84
  79. 79. THANK YOU!• Samur Araujos.f.cardosodearaujo@tudelft.nl SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 85

×