Your SlideShare is downloading. ×
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Schema matching for merging data feeds
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Schema matching for merging data feeds

223

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
223
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching
    Arnab Nandi Phil BernsteinUniv of Michigan Microsoft Research
  • 2. Scenario
    Arnab Nandi & Phil Bernstein
    2
  • 3. Scenario
    Arnab Nandi & Phil Bernstein
    3
    Search over structured data
    Commerce
    entertainment
    Data onboarding– merge an XML data feed from a 3rd partyto Microsoft data warehouse.
  • 4. Scenario
    Arnab Nandi & Phil Bernstein
    4
    “Amazon.com”
    3rd Party Feed
    3rd Party Feed
    3rd Party Feed
    3rd Party Feed
    query
    Users
    Search engine + data warehouse
    results
    • High Precision
    • 5. High Recall
    • 6. Minimal Human Involvement
  • Example Feed
    3rd Party Movie Site (Foreign)
    Warehouse: Movies (Host)
    -<Movie>
     <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title>
     <Release Key="Yes">2008</Release>
     <Description>Ever…</Description>
     <RunTime>127</RunTime>
    <Categories>
     <Category>Action</Category>
     <Category>Comedy</Category>
     </Categories>
     <MPAA>PG-13</MPAA>
     <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>
    -<Persons>
     <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>
    -</Persons>
     </Movie>
    <MOVIE>
     <MOVIE_ID>57590</MOVIE_ID>
     <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME>
     <RUNTIME>02:00</RUNTIME>
     <GENRE1>Action/Adventure</GENRE1>
     <GENRE2/>
     <MPAA>NR</MPAA>
     <ADVISORY/>
     <URL>http://www.indianajones.com/</URL>
     <ACTOR1>Harrison Ford</ACTOR1>
     <ACTOR2>Karen Allen</ACTOR2>
    </MOVIE>
    5
    Arnab Nandi & Phil Bernstein
  • 7. Schema Matching
    3rd Party Movie Site (Foreign)
    Warehouse: Movies (Host)
    -<Movie>
     <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title>
     <Release Key="Yes">2008</Release>
     <Description>Ever…</Description>
     <RunTime>127</RunTime>
    <Categories>
     <Category>Action</Category>
     <Category>Comedy</Category>
     </Categories>
     <MPAA>PG-13</MPAA>
     <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>
    -<Persons>
     <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>
    -</Persons>
     </Movie>
    <MOVIE>
     <MOVIE_ID>57590</MOVIE_ID>
     <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME>
     <RUNTIME>02:00</RUNTIME>
     <GENRE1>Action/Adventure</GENRE1>
     <GENRE2/>
     <RATING>NR</RATING>
     <ADVISORY/>
     <URL>http://www.indianajones.com/</URL>
     <ACTOR1>Harrison Ford</ACTOR1>
     <ACTOR2>Karen Allen</ACTOR2>
    </MOVIE>
    6
    Arnab Nandi & Phil Bernstein
  • 8. Taxonomy Matching
    3rd Party Movie Site (Foreign)
    Warehouse: Movies (Host)
    -<Movie>
     <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title>
     <Release Key="Yes">2008</Release>
     <Description>Ever…</Description>
     <RunTime>127</RunTime>
    <Categories>
     <Category>Action</Category>
     <Category>Comedy</Category>
     </Categories>
     <MPAA>PG-13</MPAA>
     <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>
    -<Persons>
     <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>
    -</Persons>
     </Movie>
    <MOVIE>
     <MOVIE_ID>57590</MOVIE_ID>
     <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME>
     <RUNTIME>02:00</RUNTIME>
     <GENRE1>Action/Adventure</GENRE1>
     <GENRE2/>
     <RATING>NR</RATING>
     <ADVISORY/>
     <URL>http://www.indianajones.com/</URL>
     <ACTOR1>Harrison Ford</ACTOR1>
     <ACTOR2>Karen Allen</ACTOR2>
    </MOVIE>
    7
    Arnab Nandi & Phil Bernstein
  • 9. Various Problems
    8
    Badly normalized….
    Unit conversion…
    In-band signaling…
    Arbitrary labels
    Zero documentation
    Not enough instances
    Formatting choices…
    Non standard vocabulary / language
    Arnab Nandi & Phil Bernstein
  • 10. Unlike conventional matching…
    Arnab Nandi & Phil Bernstein
    9
    3rd Party Feed
    query
    Users
    Search engine + data warehouse
    We have web search click data
    For both Warehouse & 3rd party website
    The databases we are integrating (usually) have a presence on the web
    Why not use click data as a feature for schema & taxonomy matching?
    results
  • 11. Outline
    10
    Scenario
    Using Clicklogs
    Core idea
    Using Query Distributions
    Example
    System Architecture
    Results
    Arnab Nandi & Phil Bernstein
  • 12. Core idea
    11
    “If two (sets of) products are searched for by similar queries, then they are similar”
    Web Search
    Small laptop
    Arnab Nandi & Phil Bernstein
  • 13. Core idea
    12
    Warehouse
    Asus.com
    Clicklog
    hardware
    Small Laptops
    Pro. Laptops
    eee
    X
    Y
    eee ::: small laptops
    Small laptop
    Small laptop
    Small laptop
    Z
    Arnab Nandi & Phil Bernstein
  • 14. Query Distributions
    click count
    Arnab Nandi & Phil Bernstein
    13
  • 15. Mapping to Taxonomy
    14
    Map URL to product, which belongs to taxonomy
    http://www.amazon.com/dp/B001JTA59C
    Shopping | Electronics |Netbooks
    3rd party DB
    (provided to us)
    Arnab Nandi & Phil Bernstein
  • 16. Aggregating Query Distributions
    15
    Warehouse
    Asus.com
    hardware
    Small Laptops
    Pro. Laptops
    eee
    eee ::: small laptops
    Arnab Nandi & Phil Bernstein
  • 17. Aggregate URLs to categories
    16
    Aggregate queries for each URL to schema element / taxonomy term
    Electronics|ElectronicsFeatures|Brands|Asus EEE
    “netbook”, “laptop”, “cheap laptop”
    Office Products|OfficeMachines|Netbooks
    “netbook”
    Arnab Nandi & Phil Bernstein
  • 18. Generating Correspondences
    Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.
    Process
    For each page (URL)
    Identify query distribution
    Identify category / schema element of that page
    For each category / schema element C
    Aggregate over pages in C to get query distribution
    For each foreign category / schema element
    Find host category / schema element with most similar query distribution
    17
    Arnab Nandi & Phil Bernstein
  • 19. Outline
    18
    Scenario
    Using Clicklogs
    Core idea
    Using Query Distributions
    Example
    System Architecture
    Results
    Arnab Nandi & Phil Bernstein
  • 20. Example: Taxonomy Matching
    Arnab Nandi & Phil Bernstein
    19
    Warehouse: Professional Laptops
    Warehouse: Small Laptops
    eee
  • 21. Example: Taxonomy Matching
    Arnab Nandi & Phil Bernstein
    20
    “laptop” : 70 / 75“netbook” : 5/75
    Warehouse: Professional Laptops
    “laptop”: 25/45“netbook”: 20/45
    Warehouse: Small Laptops
    “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25
    eee
  • 22. Distribution Similarity Metric
    Arnab Nandi & Phil Bernstein
    21
    Jaccard(qhost, qforeign) ✕MinFreq(qhost, qforeign)
    Σ
    (all qhost, qforeign combinations)
  • 23. “small laptops” vs “eee”laptop vs laptop netbookvsnetbooklaptop vs cheap laptop
    1 x (25/45) + 1 x (20/45)+ 0.5 x (5/25)
    = 0.74
    Example: Taxonomy Matching
    Arnab Nandi & Phil Bernstein
    22
    Warehouse: Professional Laptops
    “laptop” : 70 / 75“netbook” : 5/75
    0.31
    Warehouse: Small Laptops
    “laptop”: 25/45“netbook”: 20/45
    0.74
    eee
    “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25
  • 24. Advantages of Clicklogs
    Arnab Nandi & Phil Bernstein
    23
    Resilient to language
    Resilient to new domains, data, and features
    As long as people query & click, we have data to learn from
    Generates mappings previous methods can’t
    Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments≈ Office Products ▷ Office Machines ▷ Calculators
    Software ▷ Categories ▷ Programming ▷ Programming Languages ▷Visual Basic  ≈ Software ▷ Developer Tools
  • 25. System Design
    24
    Arnab Nandi & Phil Bernstein
  • 26. Outline
    25
    Scenario
    Using Clicklogs
    Core idea
    Using Query Distributions
    Example
    System Architecture
    Results
    Arnab Nandi & Phil Bernstein
  • 27. Experimenting with Click Logs
    Arnab Nandi & Phil Bernstein
    26
    Commercial warehouse mapping, 258 products
    from a 70,000 term Amazon.com taxonomy (613 in gold)
    to a 6,000 term warehouse taxonomy (40 in gold)
    Live.com (now Bing.com) search querylog
    Amazon to warehouse mapping task, consecutively halving the clicklog size used
    1.8 million clicks to Amazon.com product pages
    Typically each product had a query distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).
  • 28. Summary of Results
    Arnab Nandi & Phil Bernstein
    27
    90% precision / recall possible
    Query distribution is a good similarity metric
    Bigger clicklogs imply better recall
    Technique isn't very sensitive to similarity metric
  • 29. Precision / Recall
    Arnab Nandi & Phil Bernstein
    28
    Commercial warehouse mapping, 258 products
    from a 70K term Amazon.com taxonomy
    to a 6,000 term warehouse taxonomy (613 categories used)
  • 30. Summary of Results
    Arnab Nandi & Phil Bernstein
    29
    90% precision / recall possible
    Query distribution is a good similarity metric
    Bigger clicklogs imply better recall
    Technique isn't very sensitive to similarity metric
  • 31. Match Quality
    Arnab Nandi & Phil Bernstein
    30
    • QDs of entities are closest to the distributions of their aggregate classes
    • 32. QDs of similar aggregates are similar
    QDs are unique to entities 
    QDs are unique to aggregate classes 
  • 33. Summary of Results
    Arnab Nandi & Phil Bernstein
    31
    90% precision / recall possible
    Query distribution is a good similarity metric
    Bigger clicklogs imply better recall
    Technique isn't very sensitive to similarity metric
  • 34. Varying Clicklog Size
    32
    Successively decreased clicklog size by half
    Recall decreases as clicklog size is decreased
    Arnab Nandi & Phil Bernstein
  • 35. Summary of Results
    Arnab Nandi & Phil Bernstein
    33
    90% precision / recall possible
    Query distribution is a good similarity metric
    Bigger clicklogs imply better recall
    Technique isn't very sensitive to similarity metric
  • 36. Comparing Query Distributions
    34
    Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)
    • ReplaceJaccardwith various phrase similarity metrics
    • 37. Minimal difference due to size of most queries
    Σ
    (all qhost, qforeign combinations)
    Arnab Nandi & Phil Bernstein
  • 38. Summary of Results
    Arnab Nandi & Phil Bernstein
    35
    90% precision / recall possible
    • Query distribution is a good similarity metric
    • 39. Bigger clicklogs imply better recall
    • 40. Technique isn't very sensitive to similarity metric
  • Related + Future Work
    Arnab Nandi & Phil Bernstein
    36
    Usage Based / Crowdsourcing
    Usage-Based Schema Matching (ICDE 2008)Elmeleegy, H.; Ouzzani, M.; Elmagarmid, A.
    Matching schemas in online communities: A web 2.0 approach(ICDE 2008) R McCann, W Shen, AH Doan
    Web Scale Integration
    Web-scale Data Integration: You can only afford to Pay As You Go (CIDR 2007)JayantMadhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy
  • 41. Related + Future Work
    Arnab Nandi & Phil Bernstein
    37
    “Mixed” methods
    Ontology matching: A machine learning approach (Handbook on Ontologies 2004)A Doan, J Madhavan, P Domingos, A Halevy
    Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003)A Doan, P Domingos, A Halevy
    Schema and ontology matching with COMA++ (SIGMOD 2005)D Aumueller, HH Do, S Massmann, E Rahm
  • 42. Conclusion
    Unsupervised mapping is possible
    very high recall / precision when enough queries are present
    Click logs are promising
    Finds results that other methods cannot find
    As clicklog size increases, it will produce more mappings
    Combinable with existing methods
    38
    Arnab Nandi & Phil Bernstein
  • 43. http://arnab.org/contact
    http://research.microsoft.com/~philbe/
    Questions?
    39
    Arnab Nandi & Phil Bernstein
  • 44. Existing Methods
    40
    A Survey of Approaches to Automatic Schema Matching (VLDBJ 2001)  Erhard Rahm, Philip A. Bernstein
    Arnab Nandi & Phil Bernstein
  • 45. Name-based & Instance-based
    Arnab Nandi & Phil Bernstein
    41
    Not ideal for our use case
    Need high precision
    “Task B”: Commercial warehouse mapping, 258 products in a 70K term taxonomy to a 6,000 term taxonomy

×