Error Tolerant Record Matching PVERConf_May2011

771 views

Published on

May 2011 Personal Validation and Entity Resolution Conference. Presenter: Surajit Chaudhuri, Microsoft Research

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
771
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Thanks for the generous introduction. It is a great pleasure to speak to you. My talk today centers around the work that has been done as part of our Data Cleaning project. Those in this list really did all the hard work and so I would like to acknowledge their contributions.
  • Instead of trying to define what is data cleaning and approximate match, let me motivate the challenges in this domain through a few examples. Here is one all of us are familiar with where we type an address, perhaps with some errors, as in the example above, and lookup a set of addresses and get directions. So, what you would like the system to do is to do an approximate match.
  • Here is another familiar example. Here is a screenshot of Microsoft’s Product Search that aggregates offers from multiple retailers for products, like CNET.com or Amazon.com. You want to ideally make sure that we recognize multiple offers from same product so that the consumer can compare prices – so the two highlighted boxes should really come together because these two records represent the same entity. We are yet not good at doing this (because they unlike Windows Local Live don’t use technology from my group) but id
  • The key challenge therefore is to be able to do this at scale. This is because you may have a large number of addresses.
  • Do not mention that all of the 17 functions are useful in different applications. Rather, stress that we need support for more than one. Just picking one similarity function will not work.
  • Do not mention that all of the 17 functions are useful in different applications. Rather, stress that we need support for more than one. Just picking one similarity function will not work.
  • If the edit distance between two strings is small, then the jacc. similarity between their q-gram sets (in this instance, 1-gram sets) is large.
  • Mention that the lists are rid-lists. So 2, 10, etc. are record ids.
  • SupposeJaccSim (r,s) >= 2/3. If we pick >1/3 elements of r, at least 1 must also be in s.
  • Mention that the lists are rid-lists. So 2, 10, etc. are record ids.
  • Here is an example. On the left side, we list some example entities from msn shopping database, and on the right side, there are some web documents talking about those products. As we can see, people have many different ways to refer to a product, and the description they use are often different from the database version. Exact matching will fail to catch those mentions. To address this problem, we use approximate match that compute the similarity score between sub-strings and entities, as soon as their similarity score exceeds a threshold T, we consider it as a match.
  • Here are some key uses of our software. Bing Maps uses Fuzzy Lookup at the front end for matching user queries against landmarks, and also at the back-end to de-dupe yellow page feeds.Bing Shopping uses our software for back-end de-duplication of product names and descriptions. There are other key uses of our software as listed on the slide which I will skip over.
  • Error Tolerant Record Matching PVERConf_May2011

    1. 1. Error Tolerant Record Matching<br />Surajit Chaudhuri<br />Microsoft Research<br />
    2. 2. Key Contributors<br />Sanjay Agrawal<br />Arvind Arasu<br />Zhimin Chen<br />Kris Ganjam<br />Venky Ganti<br />Raghav Kaushik<br />Christian Konig<br />Rajeev Motwani (Stanford)<br />VivekNarasayya<br />Dong Xin<br />5/23/2011<br />surajitc@microsoft.com<br />2<br />
    3. 3. 5/23/2011<br />Analysis Services<br />Query / Reporting<br />Data Mining<br />Data Warehousing & Business Intelligence<br />Data Warehouse<br />Extract - Transform – Load<br />External Source<br />surajitc@microsoft.com<br />3<br />
    4. 4. Bing Maps<br />5/23/2011<br />surajitc@microsoft.com<br />4<br />
    5. 5. Bing Shopping<br />5/23/2011<br />surajitc@microsoft.com<br />5<br />
    6. 6. OBJECTIVE: Reduce Cost of building a data cleaning application<br />5/23/2011<br />surajitc@microsoft.com<br />6<br />
    7. 7. Our Approach to Data Cleaning<br />Address Matching<br />Product De-duplication<br />Focus of this talk<br />Windows Live Products<br />Local Live<br />Record <br />Matching<br />Parsing<br />Design Tools<br />De-duplication<br />Core Operators<br />surajitc@microsoft.com<br />5/23/2011<br />7<br />
    8. 8. Challenge: Record Matching over Large Data Sets<br />5/23/2011<br />surajitc@microsoft.com<br />Reference table of addresses<br />Prairie Crosing Dr <br />W Chicago IL 60185<br />Large Table<br />(~10M Rows)<br />8<br />
    9. 9. Efficient Indexing is Needed<br />5/23/2011<br />surajitc@microsoft.com<br /><ul><li> Needed for Efficiency & Scalability
    10. 10. Specific to similarity function</li></ul>Reference table<br />Prairie Crosing Dr <br />W Chicago IL 60185<br />Find all rows sj such that<br />Sim (r, sj) ≥ θ<br />Large Table<br />(~10M Rows)<br />9<br />
    11. 11. Outline<br />Introduction and Motivation<br />Two Challenges in Record Matching<br />Concluding Remarks<br />5/23/2011<br />surajitc@microsoft.com<br />10<br />
    12. 12. Challenge 1: Too Many Similarity Functions<br />Methodology<br />Choose similarity function f appropriate for the domain<br />Choose best implementation of f with support for indexing <br />Can we get away with a common foundation and simulate these variations? <br />5/23/2011<br />surajitc@microsoft.com<br />11<br />
    13. 13. Challenge 2: Lack of Customizability<br />Abbreviations<br />USA ≈United States of America<br />St ≈ Street, NE ≈ North East<br />Name variations, <br />Mike ≈ Michael, Bill ≈ William<br />Aliases<br />One ≈ 1, First ≈ 1st<br />Can we inject customizability without loss of efficiency?<br />5/23/2011<br />surajitc@microsoft.com<br />12<br />
    14. 14. Challenge 1: Too Many Similarity Functions<br />5/23/2011<br />surajitc@microsoft.com<br />13<br />
    15. 15. Jaccard Similarity<br />5/23/2011<br />14<br />Statistical measure<br />Originally defined over sets<br />String = set of words<br />Range of values: [0,1]<br />𝐽𝑎𝑐𝑐𝑎𝑟𝑑𝑠1, 𝑠2=|𝑠1 ∩𝑠2||𝑠1 ∪𝑠2|<br /> <br />surajitc@microsoft.com<br />
    16. 16. Seeking a common Foundation: Jaccard Similarity<br />5/23/2011<br />15<br />148thAve NE, Redmond, WA <br />𝐽𝑎𝑐𝑐𝑎𝑟𝑑= 44+2≈0.66<br /> <br />140thAve NE, Redmond, WA <br />surajitc@microsoft.com<br />
    17. 17. Using Jaccard Similarity to Implement f<br />5/23/2011<br />surajitc@microsoft.com<br />Check f ≥ θ<br />Query<br />Jacc. Sim. ≥ θ’<br />String  Set<br />String  Set<br />Reference table<br />16<br />Lookup on f<br />f ≥ θJacc. Sim. ≥ θ’<br />
    18. 18. Edit Similarity Set Similarity<br />5/23/2011<br />7/8<br />Jaccard Similarity<br />C,r,o,s,s,i,n,g<br />C,r,o,s,i,n,g<br />Crosing<br />Crossing<br />surajitc@microsoft.com<br />If strlen(r) ≥ strlen(s): <br />Edit Distance(r,s) ≤ k  Jacc. Sim(1-gram(r), 1-gram(s)) ≥ (strlen(r) - k)/(strlen(r) + k))<br />17<br />
    19. 19. Inverted Index Based Approach<br />5/23/2011<br />100 Prairie Crossing Dr Chicago<br />≥ 0.5 M comparisons<br />100<br />Prairie<br />Chicago<br />Drive<br />Crossing<br />Dr<br />2<br />2<br />2<br />2<br />2<br />10<br />…<br />…<br />…<br />…<br />…<br />…<br />0.5 M<br />Rows<br />18<br />Rid Lists<br />surajitc@microsoft.com<br />
    20. 20. Prefix Filter<br />5/23/2011<br />100 Prairie Crossing Dr Chicago<br />100 Prairie Crossing Drive Chicago<br />4<br />s<br />r<br />1<br />1<br />Any size 2 subset of r has non-empty overlap with s<br />19<br />surajitc@microsoft.com<br />
    21. 21. Inverted Index Based Approach<br />5/23/2011<br />100 Prairie Crossing Dr Chicago<br />Use 100 and Prairie<br />100<br />Prairie<br />Chicago<br />Drive<br />Crossing<br />Dr<br />2<br />2<br />2<br />2<br />2<br />10<br />…<br />…<br />…<br />…<br />…<br />…<br />0.5 M<br />Rows<br />20<br />Rid Lists<br />surajitc@microsoft.com<br />
    22. 22. Signature based Indexing<br />Use signature-based scheme to further reduce cost of indexing and index lookup<br />Property: If two strings have high JC, then signatures must intersect<br />LSH signatures work well<br />5/23/2011<br />surajitc@microsoft.com<br />21<br />
    23. 23. Challenge 2: Lack of Customizability<br />5/23/2011<br />surajitc@microsoft.com<br />22<br />
    24. 24. Normalization?<br />23<br />1.0<br />Jaccard Similarity<br />A Turing<br />A Turing<br />Alan A<br />A Turing<br />Alan Turing<br />5/23/2011<br />surajitc@microsoft.com<br />
    25. 25. Normalization?<br />24<br />1.0<br />Jaccard Similarity<br />A Turing<br />A Turing<br />Alan  A<br />Aaron A<br />Aaron Turing<br />Alan Turing<br />5/23/2011<br />surajitc@microsoft.com<br />
    26. 26. Transformations<br />Transformation Rules<br />Xing  Crossing<br />Programmable Similarity<br />SetSimilarity<br />W  West<br />Dr  Drive<br />5/23/2011<br />BYU Talk<br />25<br />
    27. 27. Semantics of Programmable Similarity<br />Transformation Rules<br />Xing  Crossing<br />Programmable Similarity<br />SetSimilarity<br />W  West<br />Dr  Drive<br />Prairie Crossing Dr Chicago<br />Prairie Xing Dr Chicago<br />5/23/2011<br />BYU Talk<br />26<br />
    28. 28. Semantics: Example<br />Transformation Rules<br />Xing  Crossing<br />Programmable Similarity<br />SetSimilarity<br />W  West<br />Dr  Drive<br />Prairie Crossing Dr Chicago<br />Prairie Crossing Dr Chicago<br />Prairie Xing Dr Chicago<br />5/23/2011<br />BYU Talk<br />27<br />
    29. 29. Semantics: Example<br />Transformation Rules<br />Xing  Crossing<br />Programmable Similarity<br />SetSimilarity<br />W  West<br />Dr  Drive<br />Prairie Crossing Dr Chicago<br />Prairie CrossingDrive Chicago<br />Prairie Crossing Dr Chicago<br />Prairie XingDr Chicago<br />5/23/2011<br />BYU Talk<br />28<br />
    30. 30. Semantics: Example<br />Transformation Rules<br />Xing  Crossing<br />Programmable Similarity<br />SetSimilarity<br />W  West<br />Dr  Drive<br />Prairie Crossing Dr Chicago<br />Prairie Crossing Drive Chicago<br />Prairie Crossing Dr Chicago<br />Prairie Xing Dr Chicago<br />Prairie Xing Drive Chicago<br />5/23/2011<br />BYU Talk<br />29<br />
    31. 31. Semantics: Example<br />Transformation Rules<br />Xing  Crossing<br />Programmable Similarity<br />SetSimilarity<br />W  West<br />Dr  Drive<br />Prairie Crossing Dr Chicago<br />Prairie Crossing Drive Chicago<br />Prairie Crossing Dr Chicago<br />Prairie Xing Dr Chicago<br />Prairie Xing Drive Chicago<br />Prairie Xing Dr Chicago<br />5/23/2011<br />BYU Talk<br />30<br />
    32. 32. Semantics: Example<br />Transformation Rules<br />Xing  Crossing<br />Programmable Similarity<br />SetSimilarity<br />W  West<br />Dr  Drive<br />Prairie Crossing Dr Chicago<br />Prairie Crossing Drive Chicago<br />Prairie Crossing Drive Chicago<br />Prairie Crossing Dr Chicago<br />Prairie Xing Dr Chicago<br />Prairie Xing Drive Chicago<br />Prairie Xing Dr Chicago<br />5/23/2011<br />BYU Talk<br />Prairie Crossing Dr Chicago<br />31<br />
    33. 33. Semantics: Example<br />Transformation Rules<br />Xing  Crossing<br />Programmable Similarity<br />SetSimilarity<br />W  West<br />Dr  Drive<br />Prairie Crossing Dr Chicago<br />Prairie Crossing Drive Chicago<br />Prairie Crossing Drive Chicago<br />Prairie Crossing Dr Chicago<br />Prairie Xing Dr Chicago<br />Prairie Xing Drive Chicago<br />Prairie Xing Dr Chicago<br />5/23/2011<br />BYU Talk<br />Prairie Crossing Dr Chicago<br />32<br />
    34. 34. Semantics: Example<br />1.0<br />Transformation Rules<br />Xing  Crossing<br />Programmable Similarity<br />SetSimilarity<br />W  West<br />Dr  Drive<br />Prairie Crossing Dr Chicago<br />Prairie Crossing Drive Chicago<br />Prairie Crossing Drive Chicago<br />Prairie Crossing Dr Chicago<br />Prairie Xing Dr Chicago<br />Prairie Xing Drive Chicago<br />Prairie Xing Dr Chicago<br />5/23/2011<br />BYU Talk<br />Prairie Crossing Dr Chicago<br />33<br />
    35. 35. Source of Transformations<br />Domain-specific authorities<br />~ 200000 rules from USPS for address matching<br />Hard to capture using a black-box similarity function<br />Web<br />Wikipedia redirects<br />Program<br />First  1st, Second  2nd<br />5/23/2011<br />surajitc@microsoft.com<br />34<br />
    36. 36. Computational Challenge: Blowup<br />5/23/2011<br />surajitc@microsoft.com<br />1. ATT<br />2. American Telephone and Telegraph<br />1. Corp<br />2. Corporation<br />1. 100<br />2. One Hundred<br />3. Hundred<br />4. Door 100<br />ATT Corp., <br />100 Prairie Xing Dr Chicago, IL, USA <br />1. Xing<br />2. Crossing<br />384 variations!<br />1. Dr<br />2. Drive<br />1. IL<br />2. Illinois<br />1. USA<br />2. United States<br />3. United States of America<br />35<br />
    37. 37. Similarity With Transformations: Bipartite Matching<br />5/23/2011<br />surajitc@microsoft.com<br />Prairie <br />Xing <br />Dr <br />Chicago<br />Prairie <br />Crossing <br />Drive<br />Chicago<br />Xing  Crossing<br />Max Intersection = Max Matching = 4<br />W  West<br />Max Jaccard = Max Intersection / (8 – Max Intersection) <br />= 4/4 = 1<br />Dr  Drive<br />36<br />
    38. 38. Extensions to Signature based Indexing<br />Use same LSH signature-based scheme to reduce cost of indexing and index lookup<br />Two Properties: <br /> If two strings have high JC, then signatures must intersect<br />All LSH signatures corresponding to generated strings can be obtained efficiently without materializing<br />5/23/2011<br />surajitc@microsoft.com<br />37<br />
    39. 39. Challenge of Setting Thresholds<br />Union<br />What are the “right” thresholds?<br />0.9<br />0.7<br />Similarity Join<br />(St, City)<br />Similarity Join<br />(St, State,Zip)<br />R (St,City,State,Zip)<br />Parse Address<br />S (St,City,State,Zip)<br />R (Address)<br />5/23/2011<br />surajitc@microsoft.com<br />WA  Washington<br />WI  Wisconsin<br />FL  Florida<br />Xing  Crossing<br />Xing  Crossing<br />W  West<br />W  West<br />Dr  Drive<br />Dr  Drive<br />38<br />
    40. 40. Learning From Examples<br />5/23/2011<br /><ul><li>Input
    41. 41. A set of examples: matches & non-matches
    42. 42. An operator tree invoking (multiple) Sim Join operations
    43. 43. Goal
    44. 44. Set the thresholds such that
    45. 45. (Number of thresholds = no. of join columns)
    46. 46. Precision Threshold : the number of false positives is less than B
    47. 47. Recall is maximized: Number of correctly classified matching pairs
    48. 48. Can be generalized to also choose joining columns and similarity functions</li></ul>surajitc@microsoft.com<br />39<br />
    49. 49. Outline<br />Introduction and Motivation<br />Two Challenges in Record Matching<br />Concluding Remarks<br />5/23/2011<br />surajitc@microsoft.com<br />40<br />
    50. 50. Real-World Record Matching Task<br />match against enquiries<br />Katrina: Given evacuee lists…<br />41<br />
    51. 51. Beyond Enterprise Data <br />The Canon EOS Rebel XTi remains a very good first dSLR…<br />The EOS Digital Rebel XTi is the product of Canon's extensive in-house development…<br />New ThinkPad X61 Tablet models are available with Intel® Centrino® Pro processor…<br />Documents<br /> Challenge: Pairwise Matching<br />surajitc@microsoft.com<br />5/23/2011<br />42<br />42<br />
    52. 52. Final Thoughts<br />Goal: Make Application building easier<br />Customizability; Efficiency<br />Internal Impact of MSR’s Record Matching<br />SQL Server Integration Services; Relationship Discovery in Excel PowerPivot<br />Bing Maps, Bing Shopping<br />Open Issues<br />Design Studio for Record Matching<br />Record Matching for Web Scale Problems<br />Broader use of Feature engineering techniques<br />5/23/2011<br />43<br />surajitc@microsoft.com<br />
    53. 53. Questions?<br />5/23/2011<br />surajitc@microsoft.com<br />44<br />
    54. 54. References <br />Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik, Learning String Transformations from Examples, in VLDB, Very Large Data Bases Endowment Inc., August 2009 <br />Surajit Chaudhuri and Raghav Kaushik, Extending Autocompletion to Tolerate Errors, in ACM SIGMOD June 2009 <br />Arvind Arasu, Christopher Re, and Dan Suciu, Large-Scale Deduplication with Constraints using Dedupalog, in IEEE ICDE 2009<br />Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik, Transformation-based Framework for Record Matching, in IEEE ICDE 2008<br />Surajit Chaudhuri, Bee Chung Chen, VenkateshGanti, and Raghav Kaushik, Example Driven Design of Efficient Record Matching Queries, in VLDB 2007 <br />Surajit Chaudhuri, Anish Das Sarma, VenkateshGanti, and Raghav Kaushik., Leveraging Aggregate Constraints for Deduplication, in SIGMOD 2007 <br />Surajit Chaudhuri, VenkateshGanti, Surajit Chaudhuri: Robust Identification of Fuzzy Duplicates., in IEEE ICDE 2005: 865-876<br />Surajit Chaudhuri, Kris Ganjam, VenkateshGanti, and Surajit Chaudhuri, Robust and efficient fuzzy match for online data cleaning, in SIGMOD 2003 <br />5/23/2011<br />surajitc@microsoft.com<br />45<br />
    55. 55. Appendix: “Robust Identification of Fuzzy Duplicates” (IEEE Data Engineering, 2005)<br />5/23/2011<br />surajitc@microsoft.com<br />46<br />
    56. 56. 47<br />Deduplication<br />Given a relation R, the goal is to partition R into groups such that each group consists of “duplicates” (of the same entity)<br />Also called reference reconciliation, entity resolution, merge/purge<br />Record matching, record linkage: identify record pairs (across relations) which are duplicates<br />Important sub-goals of deduplication<br />
    57. 57. 48<br />Previous Techniques<br />Distance functions to abstract closeness between tuples<br />E.g., edit distance, cosine similarity, etc. <br />Approach 1: clustering<br />Hard to determine number of clusters<br />Approach 2: partition into “valid” groups<br />Global threshold g<br />All pairs of tuples whose distance < g are considered duplicates<br />Partitioning<br />Connected components in the threshold graph<br />
    58. 58. 49<br />Our Approach<br />Local structural properties are important for identifying sets of duplicates<br />Identify two criteria to characterize local structural properties<br />Formalize the duplicate elimination problem based upon these criteria<br />Unique solution, rich space of solutions, impact of distance transformations, etc. <br />Propose an algorithm for solving the problem <br />
    59. 59. 50<br />Compact Set (CS) Criterion<br />Duplicates are closer to each other than to other tuples<br />A group is compact if it consists of all mutual nearest neighbors<br />In {1,2,3,6,7,10,11,12}: {1,2,3}, {6,7}, {10,11,12} are compact groups<br />Good distance functions for duplicate identification have the characteristic that sets of duplicates form compact sets<br />
    60. 60. 51<br />nn(v)<br />2∙nn(v)<br />Growth spheres<br />Sparse Neighborhood (SN) Criterion<br />Duplicate tuples are well-separated from other tuples<br />Neighborhood is “sparse”<br />ng(v) = #tuples in larger sphere / #tuples in smaller sphere around v<br />ng(set S of tuples) = AGG{ng(v) of each v in S}<br />S is sparse if ng(S) < c<br />
    61. 61. 52<br />Other Constraints<br />Goal: Partition R into the minimum number of groups {G1,…,Gm} such that for all 1 ≤ i ≤ m<br />Gi is a compact set and Gi is an SN group <br />Can lead to unintuitive solutions<br />{101, 102, 104, 201, 202, 301, 302} – 1 group!<br />Size constraint: size of a group of duplicates is less than K<br />Diameter constraint: diameter of a group of duplicates is less than θ<br />

    ×