Successfully reported this slideshow.

Schema matching for merging data feeds

408 views

Published on

  • Be the first to comment

  • Be the first to like this

Schema matching for merging data feeds

  1. 1. HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching <br />Arnab Nandi Phil BernsteinUniv of Michigan Microsoft Research<br />
  2. 2. Scenario<br />Arnab Nandi & Phil Bernstein<br />2<br />
  3. 3. Scenario<br />Arnab Nandi & Phil Bernstein<br />3<br />Search over structured data<br />Commerce<br />entertainment<br />Data onboarding– merge an XML data feed from a 3rd partyto Microsoft data warehouse. <br />
  4. 4. Scenario <br />Arnab Nandi & Phil Bernstein<br />4<br />“Amazon.com”<br />3rd Party Feed<br />3rd Party Feed<br />3rd Party Feed<br />3rd Party Feed<br />query<br />Users<br />Search engine + data warehouse<br />results<br /><ul><li>High Precision
  5. 5. High Recall
  6. 6. Minimal Human Involvement</li></li></ul><li>Example Feed<br />3rd Party Movie Site (Foreign)<br />Warehouse: Movies (Host)<br />-<Movie><br /> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title><br /> <Release Key="Yes">2008</Release><br /> <Description>Ever…</Description><br /> <RunTime>127</RunTime><br /><Categories><br /> <Category>Action</Category><br /> <Category>Comedy</Category><br /> </Categories><br /> <MPAA>PG-13</MPAA><br /> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl><br />-<Persons><br /> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person><br />-</Persons><br /> </Movie><br /><MOVIE><br /> <MOVIE_ID>57590</MOVIE_ID><br /> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME><br /> <RUNTIME>02:00</RUNTIME><br /> <GENRE1>Action/Adventure</GENRE1><br /> <GENRE2/><br /> <MPAA>NR</MPAA><br /> <ADVISORY/><br /> <URL>http://www.indianajones.com/</URL><br /> <ACTOR1>Harrison Ford</ACTOR1><br /> <ACTOR2>Karen Allen</ACTOR2><br /></MOVIE><br />5<br />Arnab Nandi & Phil Bernstein<br />
  7. 7. Schema Matching<br />3rd Party Movie Site (Foreign)<br />Warehouse: Movies (Host)<br />-<Movie><br /> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title><br /> <Release Key="Yes">2008</Release><br /> <Description>Ever…</Description><br /> <RunTime>127</RunTime><br /><Categories><br /> <Category>Action</Category><br /> <Category>Comedy</Category><br /> </Categories><br /> <MPAA>PG-13</MPAA><br /> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl><br />-<Persons><br /> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person><br />-</Persons><br /> </Movie><br /><MOVIE><br /> <MOVIE_ID>57590</MOVIE_ID><br /> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME><br /> <RUNTIME>02:00</RUNTIME><br /> <GENRE1>Action/Adventure</GENRE1><br /> <GENRE2/><br /> <RATING>NR</RATING><br /> <ADVISORY/><br /> <URL>http://www.indianajones.com/</URL><br /> <ACTOR1>Harrison Ford</ACTOR1><br /> <ACTOR2>Karen Allen</ACTOR2><br /></MOVIE><br />6<br />Arnab Nandi & Phil Bernstein<br />
  8. 8. Taxonomy Matching<br />3rd Party Movie Site (Foreign)<br />Warehouse: Movies (Host)<br />-<Movie><br /> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title><br /> <Release Key="Yes">2008</Release><br /> <Description>Ever…</Description><br /> <RunTime>127</RunTime><br /><Categories><br /> <Category>Action</Category><br /> <Category>Comedy</Category><br /> </Categories><br /> <MPAA>PG-13</MPAA><br /> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl><br />-<Persons><br /> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person><br />-</Persons><br /> </Movie><br /><MOVIE><br /> <MOVIE_ID>57590</MOVIE_ID><br /> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME><br /> <RUNTIME>02:00</RUNTIME><br /> <GENRE1>Action/Adventure</GENRE1><br /> <GENRE2/><br /> <RATING>NR</RATING><br /> <ADVISORY/><br /> <URL>http://www.indianajones.com/</URL><br /> <ACTOR1>Harrison Ford</ACTOR1><br /> <ACTOR2>Karen Allen</ACTOR2><br /></MOVIE><br />7<br />Arnab Nandi & Phil Bernstein<br />
  9. 9. Various Problems<br />8<br />Badly normalized….<br />Unit conversion…<br />In-band signaling…<br />Arbitrary labels<br />Zero documentation<br />Not enough instances<br />Formatting choices…<br />Non standard vocabulary / language<br />Arnab Nandi & Phil Bernstein<br />
  10. 10. Unlike conventional matching…<br />Arnab Nandi & Phil Bernstein<br />9<br />3rd Party Feed<br />query<br />Users<br />Search engine + data warehouse<br />We have web search click data<br />For both Warehouse & 3rd party website<br />The databases we are integrating (usually) have a presence on the web<br />Why not use click data as a feature for schema & taxonomy matching?<br />results<br />
  11. 11. Outline<br />10<br />Scenario<br />Using Clicklogs<br />Core idea<br />Using Query Distributions<br />Example<br />System Architecture<br />Results <br />Arnab Nandi & Phil Bernstein<br />
  12. 12. Core idea<br />11<br />“If two (sets of) products are searched for by similar queries, then they are similar”<br />Web Search<br />Small laptop<br />Arnab Nandi & Phil Bernstein<br />
  13. 13. Core idea<br />12<br />Warehouse<br />Asus.com<br />Clicklog<br />hardware<br />Small Laptops<br />Pro. Laptops<br />eee<br />X<br />Y<br />eee ::: small laptops<br />Small laptop<br />Small laptop<br />Small laptop<br />Z<br />Arnab Nandi & Phil Bernstein<br />
  14. 14. Query Distributions<br />click count<br />Arnab Nandi & Phil Bernstein<br />13<br />
  15. 15. Mapping to Taxonomy<br />14<br />Map URL to product, which belongs to taxonomy<br />http://www.amazon.com/dp/B001JTA59C<br />Shopping | Electronics |Netbooks<br />3rd party DB<br />(provided to us)<br />Arnab Nandi & Phil Bernstein<br />
  16. 16. Aggregating Query Distributions<br />15<br />Warehouse<br />Asus.com<br />hardware<br />Small Laptops<br />Pro. Laptops<br />eee<br />eee ::: small laptops<br />Arnab Nandi & Phil Bernstein<br />
  17. 17. Aggregate URLs to categories<br />16<br />Aggregate queries for each URL to schema element / taxonomy term<br />Electronics|ElectronicsFeatures|Brands|Asus EEE<br />“netbook”, “laptop”, “cheap laptop”<br />Office Products|OfficeMachines|Netbooks<br />“netbook”<br />Arnab Nandi & Phil Bernstein<br />
  18. 18. Generating Correspondences<br />Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.<br />Process<br />For each page (URL)<br />Identify query distribution<br />Identify category / schema element of that page<br />For each category / schema element C<br />Aggregate over pages in C to get query distribution<br />For each foreign category / schema element <br />Find host category / schema element with most similar query distribution<br />17<br />Arnab Nandi & Phil Bernstein<br />
  19. 19. Outline<br />18<br />Scenario<br />Using Clicklogs<br />Core idea<br />Using Query Distributions<br />Example<br />System Architecture<br />Results <br />Arnab Nandi & Phil Bernstein<br />
  20. 20. Example: Taxonomy Matching<br />Arnab Nandi & Phil Bernstein<br />19<br />Warehouse: Professional Laptops<br />Warehouse: Small Laptops<br />eee<br />
  21. 21. Example: Taxonomy Matching<br />Arnab Nandi & Phil Bernstein<br />20<br />“laptop” : 70 / 75“netbook” : 5/75<br />Warehouse: Professional Laptops<br />“laptop”: 25/45“netbook”: 20/45<br />Warehouse: Small Laptops<br />“laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25<br />eee<br />
  22. 22. Distribution Similarity Metric<br />Arnab Nandi & Phil Bernstein<br />21<br />Jaccard(qhost, qforeign) ✕MinFreq(qhost, qforeign)<br />Σ<br />(all qhost, qforeign combinations)<br />
  23. 23. “small laptops” vs “eee”laptop vs laptop netbookvsnetbooklaptop vs cheap laptop<br />1 x (25/45) + 1 x (20/45)+ 0.5 x (5/25)<br />= 0.74<br />Example: Taxonomy Matching<br />Arnab Nandi & Phil Bernstein<br />22<br />Warehouse: Professional Laptops<br />“laptop” : 70 / 75“netbook” : 5/75<br /> 0.31 <br />Warehouse: Small Laptops<br />“laptop”: 25/45“netbook”: 20/45<br /> 0.74<br />eee<br />“laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25<br />
  24. 24. Advantages of Clicklogs<br />Arnab Nandi & Phil Bernstein<br />23<br />Resilient to language<br />Resilient to new domains, data, and features<br />As long as people query & click, we have data to learn from<br />Generates mappings previous methods can’t<br />Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments≈ Office Products ▷ Office Machines ▷ Calculators<br />Software ▷ Categories ▷ Programming ▷ Programming Languages ▷Visual Basic  ≈ Software ▷ Developer Tools<br />
  25. 25. System Design<br />24<br />Arnab Nandi & Phil Bernstein<br />
  26. 26. Outline<br />25<br />Scenario<br />Using Clicklogs<br />Core idea<br />Using Query Distributions<br />Example<br />System Architecture<br />Results <br />Arnab Nandi & Phil Bernstein<br />
  27. 27. Experimenting with Click Logs<br />Arnab Nandi & Phil Bernstein<br />26<br />Commercial warehouse mapping, 258 products<br />from a 70,000 term Amazon.com taxonomy (613 in gold)<br />to a 6,000 term warehouse taxonomy (40 in gold)<br />Live.com (now Bing.com) search querylog<br />Amazon to warehouse mapping task, consecutively halving the clicklog size used<br />1.8 million clicks to Amazon.com product pages<br />Typically each product had a query distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).<br />
  28. 28. Summary of Results<br />Arnab Nandi & Phil Bernstein<br />27<br />90% precision / recall possible<br />Query distribution is a good similarity metric<br />Bigger clicklogs imply better recall<br />Technique isn't very sensitive to similarity metric<br />
  29. 29. Precision / Recall<br />Arnab Nandi & Phil Bernstein<br />28<br />Commercial warehouse mapping, 258 products<br />from a 70K term Amazon.com taxonomy<br />to a 6,000 term warehouse taxonomy (613 categories used)<br />
  30. 30. Summary of Results<br />Arnab Nandi & Phil Bernstein<br />29<br />90% precision / recall possible<br />Query distribution is a good similarity metric<br />Bigger clicklogs imply better recall<br />Technique isn't very sensitive to similarity metric<br />
  31. 31. Match Quality<br />Arnab Nandi & Phil Bernstein<br />30<br /><ul><li>QDs of entities are closest to the distributions of their aggregate classes
  32. 32. QDs of similar aggregates are similar</li></ul>QDs are unique to entities <br />QDs are unique to aggregate classes <br />
  33. 33. Summary of Results<br />Arnab Nandi & Phil Bernstein<br />31<br />90% precision / recall possible<br />Query distribution is a good similarity metric<br />Bigger clicklogs imply better recall<br />Technique isn't very sensitive to similarity metric<br />
  34. 34. Varying Clicklog Size<br />32<br />Successively decreased clicklog size by half<br />Recall decreases as clicklog size is decreased<br />Arnab Nandi & Phil Bernstein<br />
  35. 35. Summary of Results<br />Arnab Nandi & Phil Bernstein<br />33<br />90% precision / recall possible<br />Query distribution is a good similarity metric<br />Bigger clicklogs imply better recall<br />Technique isn't very sensitive to similarity metric<br />
  36. 36. Comparing Query Distributions<br />34<br /> Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)<br /><ul><li>ReplaceJaccardwith various phrase similarity metrics
  37. 37. Minimal difference due to size of most queries</li></ul>Σ<br />(all qhost, qforeign combinations)<br />Arnab Nandi & Phil Bernstein<br />
  38. 38. Summary of Results<br />Arnab Nandi & Phil Bernstein<br />35<br />90% precision / recall possible<br /><ul><li>Query distribution is a good similarity metric
  39. 39. Bigger clicklogs imply better recall
  40. 40. Technique isn't very sensitive to similarity metric</li></li></ul><li>Related + Future Work<br />Arnab Nandi & Phil Bernstein<br />36<br />Usage Based / Crowdsourcing<br />Usage-Based Schema Matching (ICDE 2008)Elmeleegy, H.; Ouzzani, M.; Elmagarmid, A.<br />Matching schemas in online communities: A web 2.0 approach(ICDE 2008) R McCann, W Shen, AH Doan<br />Web Scale Integration<br />Web-scale Data Integration: You can only afford to Pay As You Go (CIDR 2007)JayantMadhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy<br />
  41. 41. Related + Future Work<br />Arnab Nandi & Phil Bernstein<br />37<br />“Mixed” methods<br />Ontology matching: A machine learning approach (Handbook on Ontologies 2004)A Doan, J Madhavan, P Domingos, A Halevy<br />Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003)A Doan, P Domingos, A Halevy<br />Schema and ontology matching with COMA++ (SIGMOD 2005)D Aumueller, HH Do, S Massmann, E Rahm<br />
  42. 42. Conclusion<br />Unsupervised mapping is possible<br />very high recall / precision when enough queries are present<br />Click logs are promising<br />Finds results that other methods cannot find<br />As clicklog size increases, it will produce more mappings<br />Combinable with existing methods<br />38<br />Arnab Nandi & Phil Bernstein<br />
  43. 43. http://arnab.org/contact<br />http://research.microsoft.com/~philbe/<br />Questions?<br />39<br />Arnab Nandi & Phil Bernstein<br />
  44. 44. Existing Methods<br />40<br />A Survey of Approaches to Automatic Schema Matching (VLDBJ 2001)  Erhard Rahm, Philip A. Bernstein<br />Arnab Nandi & Phil Bernstein<br />
  45. 45. Name-based & Instance-based<br />Arnab Nandi & Phil Bernstein<br />41<br />Not ideal for our use case<br />Need high precision<br />“Task B”: Commercial warehouse mapping, 258 products in a 70K term taxonomy to a 6,000 term taxonomy<br />

×