HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

1,469 views
1,435 views

Published on

We address the problem of unsupervised matching of schema
information from a large number of data sources into the
schema of a data warehouse. The matching process is the
first step of a framework to integrate data feeds from third-
party data providers into a structured-search engine’s data
warehouse. Our experiments show that traditional schema-
based and instance-based schema matching methods fall short.
We propose a new technique based on the search engine’s
clicklogs. Two schema elements are matched if the distribution of keyword queries that cause clickthroughs on their
instances are similar. We present experiments on large commercial datasets that show the new technique has much better accuracy than traditional techniques.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,469
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
20
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

  1. 1. HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching <br />Arnab Nandi Phil BernsteinUniv of Michigan Microsoft Research<br />
  2. 2. Scenario<br />Arnab Nandi & Phil Bernstein<br />2<br />
  3. 3. Scenario<br />Arnab Nandi & Phil Bernstein<br />3<br />Search over structured data<br />Commerce<br />entertainment<br />Data onboarding– merge an XML data feed from a 3rd partyto Microsoft data warehouse. <br />
  4. 4. Scenario <br />Arnab Nandi & Phil Bernstein<br />4<br />“Amazon.com”<br />3rd Party Feed<br />3rd Party Feed<br />3rd Party Feed<br />3rd Party Feed<br />query<br />Users<br />Search engine + data warehouse<br />results<br /><ul><li>High Precision
  5. 5. (Irrespective ofRecall)
  6. 6. Minimal Human Involvement</li></li></ul><li>Example Feed<br />3rd Party Movie Site (Foreign)<br />Warehouse: Movies (Host)<br />-&lt;Movie&gt;<br /> &lt;Title Key=&quot;Yes&quot;&gt;Indiana Jones and The Kingdom of The Crystal Skull&lt;/Title&gt;<br /> &lt;Release Key=&quot;Yes&quot;&gt;2008&lt;/Release&gt;<br /> &lt;Description&gt;Ever…&lt;/Description&gt;<br /> &lt;RunTime&gt;127&lt;/RunTime&gt;<br />&lt;Categories&gt;<br /> &lt;Category&gt;Action&lt;/Category&gt;<br /> &lt;Category&gt;Comedy&lt;/Category&gt;<br /> &lt;/Categories&gt;<br /> &lt;MPAA&gt;PG-13&lt;/MPAA&gt;<br /> &lt;SiteUrl&gt;http://www.indianajones.com/site/index.html&lt;/SiteUrl&gt;<br />-&lt;Persons&gt;<br /> &lt;Person Role=&quot;Actor&quot; Character=&quot;Indiana Jones&quot;&gt;Harrison Ford&lt;/Person&gt;<br />-&lt;/Persons&gt;<br /> &lt;/Movie&gt;<br />&lt;MOVIE&gt;<br /> &lt;MOVIE_ID&gt;57590&lt;/MOVIE_ID&gt;<br /> &lt;MOVIE_NAME&gt;Indiana Jones and the Kingdom of the Crystal Skull&lt;/MOVIE_NAME&gt;<br /> &lt;RUNTIME&gt;02:00&lt;/RUNTIME&gt;<br /> &lt;GENRE1&gt;Action/Adventure&lt;/GENRE1&gt;<br /> &lt;GENRE2/&gt;<br /> &lt;MPAA&gt;NR&lt;/MPAA&gt;<br /> &lt;ADVISORY/&gt;<br /> &lt;URL&gt;http://www.indianajones.com/&lt;/URL&gt;<br /> &lt;ACTOR1&gt;Harrison Ford&lt;/ACTOR1&gt;<br /> &lt;ACTOR2&gt;Karen Allen&lt;/ACTOR2&gt;<br />&lt;/MOVIE&gt;<br />5<br />Arnab Nandi & Phil Bernstein<br />
  7. 7. Schema Matching<br />3rd Party Movie Site (Foreign)<br />Warehouse: Movies (Host)<br />-&lt;Movie&gt;<br /> &lt;Title Key=&quot;Yes&quot;&gt;Indiana Jones and The Kingdom of The Crystal Skull&lt;/Title&gt;<br /> &lt;Release Key=&quot;Yes&quot;&gt;2008&lt;/Release&gt;<br /> &lt;Description&gt;Ever…&lt;/Description&gt;<br /> &lt;RunTime&gt;127&lt;/RunTime&gt;<br />&lt;Categories&gt;<br /> &lt;Category&gt;Action&lt;/Category&gt;<br /> &lt;Category&gt;Comedy&lt;/Category&gt;<br /> &lt;/Categories&gt;<br /> &lt;MPAA&gt;PG-13&lt;/MPAA&gt;<br /> &lt;SiteUrl&gt;http://www.indianajones.com/site/index.html&lt;/SiteUrl&gt;<br />-&lt;Persons&gt;<br /> &lt;Person Role=&quot;Actor&quot; Character=&quot;Indiana Jones&quot;&gt;Harrison Ford&lt;/Person&gt;<br />-&lt;/Persons&gt;<br /> &lt;/Movie&gt;<br />&lt;MOVIE&gt;<br /> &lt;MOVIE_ID&gt;57590&lt;/MOVIE_ID&gt;<br /> &lt;MOVIE_NAME&gt;Indiana Jones and the Kingdom of the Crystal Skull&lt;/MOVIE_NAME&gt;<br /> &lt;RUNTIME&gt;02:00&lt;/RUNTIME&gt;<br /> &lt;GENRE1&gt;Action/Adventure&lt;/GENRE1&gt;<br /> &lt;GENRE2/&gt;<br /> &lt;RATING&gt;NR&lt;/RATING&gt;<br /> &lt;ADVISORY/&gt;<br /> &lt;URL&gt;http://www.indianajones.com/&lt;/URL&gt;<br /> &lt;ACTOR1&gt;Harrison Ford&lt;/ACTOR1&gt;<br /> &lt;ACTOR2&gt;Karen Allen&lt;/ACTOR2&gt;<br />&lt;/MOVIE&gt;<br />6<br />Arnab Nandi & Phil Bernstein<br />
  8. 8. Taxonomy Matching<br />3rd Party Movie Site (Foreign)<br />Warehouse: Movies (Host)<br />-&lt;Movie&gt;<br /> &lt;Title Key=&quot;Yes&quot;&gt;Indiana Jones and The Kingdom of The Crystal Skull&lt;/Title&gt;<br /> &lt;Release Key=&quot;Yes&quot;&gt;2008&lt;/Release&gt;<br /> &lt;Description&gt;Ever…&lt;/Description&gt;<br /> &lt;RunTime&gt;127&lt;/RunTime&gt;<br />&lt;Categories&gt;<br /> &lt;Category&gt;Action&lt;/Category&gt;<br /> &lt;Category&gt;Comedy&lt;/Category&gt;<br /> &lt;/Categories&gt;<br /> &lt;MPAA&gt;PG-13&lt;/MPAA&gt;<br /> &lt;SiteUrl&gt;http://www.indianajones.com/site/index.html&lt;/SiteUrl&gt;<br />-&lt;Persons&gt;<br /> &lt;Person Role=&quot;Actor&quot; Character=&quot;Indiana Jones&quot;&gt;Harrison Ford&lt;/Person&gt;<br />-&lt;/Persons&gt;<br /> &lt;/Movie&gt;<br />&lt;MOVIE&gt;<br /> &lt;MOVIE_ID&gt;57590&lt;/MOVIE_ID&gt;<br /> &lt;MOVIE_NAME&gt;Indiana Jones and the Kingdom of the Crystal Skull&lt;/MOVIE_NAME&gt;<br /> &lt;RUNTIME&gt;02:00&lt;/RUNTIME&gt;<br /> &lt;GENRE1&gt;Action/Adventure&lt;/GENRE1&gt;<br /> &lt;GENRE2/&gt;<br /> &lt;RATING&gt;NR&lt;/RATING&gt;<br /> &lt;ADVISORY/&gt;<br /> &lt;URL&gt;http://www.indianajones.com/&lt;/URL&gt;<br /> &lt;ACTOR1&gt;Harrison Ford&lt;/ACTOR1&gt;<br /> &lt;ACTOR2&gt;Karen Allen&lt;/ACTOR2&gt;<br />&lt;/MOVIE&gt;<br />7<br />Arnab Nandi & Phil Bernstein<br />
  9. 9. Various Problems<br />8<br />Badly normalized….<br />Unit conversion…<br />In-band signaling…<br />Arbitrary labels<br />Zero documentation<br />Not enough instances<br />Formatting choices…<br />Non standard vocabulary / language<br />Arnab Nandi & Phil Bernstein<br />
  10. 10. Unlike conventional matching…<br />Arnab Nandi & Phil Bernstein<br />9<br />3rd Party Feed<br />query<br />Users<br />Search engine + data warehouse<br />We have web search click data<br />For both Warehouse & 3rd party website<br />The databases we are integrating (usually) have a presence on the web<br />Why not use click data as a feature for schema & taxonomy matching?<br />results<br />
  11. 11. Outline<br />10<br />Scenario<br />Using Clicklogs<br />Core idea<br />Using Query Distributions<br />Example<br />System Architecture<br />Results <br />Arnab Nandi & Phil Bernstein<br />
  12. 12. Core idea<br />11<br />“If two (sets of) products are searched for by similar queries, then they are similar”<br />Web Search<br />Small laptop<br />Arnab Nandi & Phil Bernstein<br />
  13. 13. Core idea<br />12<br />Warehouse<br />Asus.com<br />Clicklog<br />hardware<br />Small Laptops<br />Pro. Laptops<br />eee<br />X<br />Y<br />eee ::: small laptops<br />Small laptop<br />Small laptop<br />Small laptop<br />Z<br />Arnab Nandi & Phil Bernstein<br />
  14. 14. Query Distributions<br />click count<br />Arnab Nandi & Phil Bernstein<br />13<br />
  15. 15. Mapping to Taxonomy<br />14<br />Map URL to product, which belongs to taxonomy<br />http://www.amazon.com/dp/B001JTA59C<br />Shopping | Electronics |Netbooks<br />3rd party DB<br />(provided to us)<br />Arnab Nandi & Phil Bernstein<br />
  16. 16. Aggregating Query Distributions<br />15<br />Warehouse<br />Asus.com<br />hardware<br />Small Laptops<br />Pro. Laptops<br />eee<br />eee ::: small laptops<br />Arnab Nandi & Phil Bernstein<br />
  17. 17. Aggregate URLs to categories<br />16<br />Aggregate queries for each URL to schema element / taxonomy term<br />Electronics|ElectronicsFeatures|Brands|Asus EEE<br />“netbook”, “laptop”, “cheap laptop”<br />Office Products|OfficeMachines|Netbooks<br />“netbook”<br />Arnab Nandi & Phil Bernstein<br />
  18. 18. Generating Correspondences<br />Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.<br />Process<br />For each page (URL)<br />Identify query distribution<br />Identify category / schema element of that page<br />For each category / schema element C<br />Aggregate over pages in C to get query distribution<br />For each foreign category / schema element <br />Find host category / schema element with most similar query distribution<br />17<br />Arnab Nandi & Phil Bernstein<br />
  19. 19. Outline<br />18<br />Scenario<br />Using Clicklogs<br />Core idea<br />Using Query Distributions<br />Example<br />System Architecture<br />Results <br />Arnab Nandi & Phil Bernstein<br />
  20. 20. Example: Taxonomy Matching<br />Arnab Nandi & Phil Bernstein<br />19<br />Warehouse: Professional Laptops<br />Warehouse: Small Laptops<br />eee<br />
  21. 21. Example: Taxonomy Matching<br />Arnab Nandi & Phil Bernstein<br />20<br />“laptop” : 70 / 75“netbook” : 5/75<br />Warehouse: Professional Laptops<br />“laptop”: 25/45“netbook”: 20/45<br />Warehouse: Small Laptops<br />“laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25<br />eee<br />
  22. 22. Distribution Similarity Metric<br />Arnab Nandi & Phil Bernstein<br />21<br />Jaccard(qhost, qforeign) ✕MinFreq(qhost, qforeign)<br />Σ<br />(all qhost, qforeign combinations)<br />
  23. 23. “small laptops” vs “eee”laptop vs laptop netbookvsnetbooklaptop vs cheap laptop<br />1 x (25/45) + 1 x (20/45)+ 0.5 x (5/25)<br />= 0.74<br />Example: Taxonomy Matching<br />Arnab Nandi & Phil Bernstein<br />22<br />Warehouse: Professional Laptops<br />“laptop” : 70 / 75“netbook” : 5/75<br /> 0.31 <br />Warehouse: Small Laptops<br />“laptop”: 25/45“netbook”: 20/45<br /> 0.74<br />eee<br />“laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25<br />
  24. 24. Advantages of Clicklogs<br />Arnab Nandi & Phil Bernstein<br />23<br />Resilient to language<br />Resilient to new domains, data, and features<br />As long as people query & click, we have data to learn from<br />Generates mappings previous methods can’t<br />Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments≈ Office Products ▷ Office Machines ▷ Calculators<br />Software ▷ Categories ▷ Programming ▷ Programming Languages ▷Visual Basic  ≈ Software ▷ Developer Tools<br />
  25. 25. System Design<br />24<br />Arnab Nandi & Phil Bernstein<br />
  26. 26. Outline<br />25<br />Scenario<br />Using Clicklogs<br />Core idea<br />Using Query Distributions<br />Example<br />System Architecture<br />Results <br />Arnab Nandi & Phil Bernstein<br />
  27. 27. Experimenting with Click Logs<br />Arnab Nandi & Phil Bernstein<br />26<br />Commercial warehouse mapping, 258 products<br />from a 70,000 term Amazon.com taxonomy (613 in gold)<br />to a 6,000 term warehouse taxonomy (40 in gold)<br />Live.com (now Bing.com) search querylog<br />Amazon to warehouse mapping task, consecutively halving the clicklog size used<br />1.8 million clicks to Amazon.com product pages<br />Typically each product had a query distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).<br />
  28. 28. Summary of Results<br />Arnab Nandi & Phil Bernstein<br />27<br />90% precision / recall possible<br />Query distribution is a good similarity metric<br />Bigger clicklogs imply better recall<br />Technique isn&apos;t very sensitive to similarity metric<br />
  29. 29. Precision / Recall<br />Arnab Nandi & Phil Bernstein<br />28<br />Commercial warehouse mapping, 258 products<br />from a 70K term Amazon.com taxonomy<br />to a 6,000 term warehouse taxonomy (613 categories used)<br />
  30. 30. Summary of Results<br />Arnab Nandi & Phil Bernstein<br />29<br /> 90% precision / recall possible<br />Query distribution is a good similarity metric<br />Bigger clicklogs imply better recall<br />Technique isn&apos;t very sensitive to similarity metric<br />
  31. 31. Match Quality<br />Arnab Nandi & Phil Bernstein<br />30<br /><ul><li>QDs of entities are closest to the distributions of their aggregate classes
  32. 32. QDs of similar aggregates are similar</li></ul>QDs are unique to entities <br />QDs are unique to aggregate classes <br />
  33. 33. Summary of Results<br />Arnab Nandi & Phil Bernstein<br />31<br />90% precision / recall possible<br />Query distribution is a good similarity metric<br />Bigger clicklogs imply better recall<br />Technique isn&apos;t very sensitive to similarity metric<br />
  34. 34. Varying Clicklog Size<br />32<br />Successively decreased clicklog size by half<br />Recall decreases as clicklog size is decreased<br />Arnab Nandi & Phil Bernstein<br />
  35. 35. Summary of Results<br />Arnab Nandi & Phil Bernstein<br />33<br />90% precision / recall possible<br />Query distribution is a good similarity metric<br />Bigger clicklogs imply better recall<br />Technique isn&apos;t very sensitive to similarity metric<br />
  36. 36. Comparing Query Distributions<br />34<br /> Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)<br /><ul><li>ReplaceJaccardwith various phrase similarity metrics
  37. 37. Minimal difference due to size of most queries</li></ul>Σ<br />(all qhost, qforeign combinations)<br />Arnab Nandi & Phil Bernstein<br />
  38. 38. Summary of Results<br />Arnab Nandi & Phil Bernstein<br />35<br />90% precision / recall possible<br /><ul><li>Query distribution is a good similarity metric
  39. 39. Bigger clicklogs imply better recall
  40. 40. Technique isn't very sensitive to similarity metric</li></li></ul><li>Related + Future Work<br />Arnab Nandi & Phil Bernstein<br />36<br />Usage Based / Crowdsourcing<br />Usage-Based Schema Matching (ICDE 2008)Elmeleegy, H.; Ouzzani, M.; Elmagarmid, A.<br />Matching schemas in online communities: A web 2.0 approach(ICDE 2008) R McCann, W Shen, AH Doan<br />Web Scale Integration<br />Web-scale Data Integration: You can only afford to Pay As You Go (CIDR 2007)JayantMadhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy<br />
  41. 41. Related + Future Work<br />Arnab Nandi & Phil Bernstein<br />37<br />“Mixed” methods<br />Ontology matching: A machine learning approach (Handbook on Ontologies 2004)A Doan, J Madhavan, P Domingos, A Halevy<br />Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003)A Doan, P Domingos, A Halevy<br />Schema and ontology matching with COMA++ (SIGMOD 2005)D Aumueller, HH Do, S Massmann, E Rahm<br />
  42. 42. Conclusion<br />Unsupervised mapping is possible<br />very high recall / precision when enough queries are present<br />Click logs are promising<br />Finds results that other methods cannot find<br />As clicklog size increases, it will produce more mappings<br />Combinable with existing methods<br />38<br />Arnab Nandi & Phil Bernstein<br />
  43. 43. http://arnab.org/contact<br />http://research.microsoft.com/~philbe/<br />Questions?<br />Arnab Nandi & Phil Bernstein<br />
  44. 44. Existing Methods<br />40<br />A Survey of Approaches to Automatic Schema Matching (VLDBJ 2001)  Erhard Rahm, Philip A. Bernstein<br />Arnab Nandi & Phil Bernstein<br />
  45. 45. Name-based & Instance-based<br />Arnab Nandi & Phil Bernstein<br />41<br />Not ideal for our use case<br />Need high precision<br />“Task B”: Commercial warehouse mapping, 258 products in a 70K term taxonomy to a 6,000 term taxonomy<br />

×