Your SlideShare is downloading. ×
Error Tolerant Record Matching PVERConf_May2011
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Error Tolerant Record Matching PVERConf_May2011

449

Published on

May 2011 Personal Validation and Entity Resolution Conference. Presenter: Surajit Chaudhuri, Microsoft Research

May 2011 Personal Validation and Entity Resolution Conference. Presenter: Surajit Chaudhuri, Microsoft Research

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
449
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Thanks for the generous introduction. It is a great pleasure to speak to you. My talk today centers around the work that has been done as part of our Data Cleaning project. Those in this list really did all the hard work and so I would like to acknowledge their contributions.
  • Instead of trying to define what is data cleaning and approximate match, let me motivate the challenges in this domain through a few examples. Here is one all of us are familiar with where we type an address, perhaps with some errors, as in the example above, and lookup a set of addresses and get directions. So, what you would like the system to do is to do an approximate match.
  • Here is another familiar example. Here is a screenshot of Microsoft’s Product Search that aggregates offers from multiple retailers for products, like CNET.com or Amazon.com. You want to ideally make sure that we recognize multiple offers from same product so that the consumer can compare prices – so the two highlighted boxes should really come together because these two records represent the same entity. We are yet not good at doing this (because they unlike Windows Local Live don’t use technology from my group) but id
  • The key challenge therefore is to be able to do this at scale. This is because you may have a large number of addresses.
  • Do not mention that all of the 17 functions are useful in different applications. Rather, stress that we need support for more than one. Just picking one similarity function will not work.
  • Do not mention that all of the 17 functions are useful in different applications. Rather, stress that we need support for more than one. Just picking one similarity function will not work.
  • If the edit distance between two strings is small, then the jacc. similarity between their q-gram sets (in this instance, 1-gram sets) is large.
  • Mention that the lists are rid-lists. So 2, 10, etc. are record ids.
  • SupposeJaccSim (r,s) >= 2/3. If we pick >1/3 elements of r, at least 1 must also be in s.
  • Mention that the lists are rid-lists. So 2, 10, etc. are record ids.
  • Here is an example. On the left side, we list some example entities from msn shopping database, and on the right side, there are some web documents talking about those products. As we can see, people have many different ways to refer to a product, and the description they use are often different from the database version. Exact matching will fail to catch those mentions. To address this problem, we use approximate match that compute the similarity score between sub-strings and entities, as soon as their similarity score exceeds a threshold T, we consider it as a match.
  • Here are some key uses of our software. Bing Maps uses Fuzzy Lookup at the front end for matching user queries against landmarks, and also at the back-end to de-dupe yellow page feeds.Bing Shopping uses our software for back-end de-duplication of product names and descriptions. There are other key uses of our software as listed on the slide which I will skip over.
  • Transcript

    • 1. Error Tolerant Record Matching
      Surajit Chaudhuri
      Microsoft Research
    • 2. Key Contributors
      Sanjay Agrawal
      Arvind Arasu
      Zhimin Chen
      Kris Ganjam
      Venky Ganti
      Raghav Kaushik
      Christian Konig
      Rajeev Motwani (Stanford)
      VivekNarasayya
      Dong Xin
      5/23/2011
      surajitc@microsoft.com
      2
    • 3. 5/23/2011
      Analysis Services
      Query / Reporting
      Data Mining
      Data Warehousing & Business Intelligence
      Data Warehouse
      Extract - Transform – Load
      External Source
      surajitc@microsoft.com
      3
    • 4. Bing Maps
      5/23/2011
      surajitc@microsoft.com
      4
    • 5. Bing Shopping
      5/23/2011
      surajitc@microsoft.com
      5
    • 6. OBJECTIVE: Reduce Cost of building a data cleaning application
      5/23/2011
      surajitc@microsoft.com
      6
    • 7. Our Approach to Data Cleaning
      Address Matching
      Product De-duplication
      Focus of this talk
      Windows Live Products
      Local Live
      Record
      Matching
      Parsing
      Design Tools
      De-duplication
      Core Operators
      surajitc@microsoft.com
      5/23/2011
      7
    • 8. Challenge: Record Matching over Large Data Sets
      5/23/2011
      surajitc@microsoft.com
      Reference table of addresses
      Prairie Crosing Dr
      W Chicago IL 60185
      Large Table
      (~10M Rows)
      8
    • 9. Efficient Indexing is Needed
      5/23/2011
      surajitc@microsoft.com
      • Needed for Efficiency & Scalability
      • 10. Specific to similarity function
      Reference table
      Prairie Crosing Dr
      W Chicago IL 60185
      Find all rows sj such that
      Sim (r, sj) ≥ θ
      Large Table
      (~10M Rows)
      9
    • 11. Outline
      Introduction and Motivation
      Two Challenges in Record Matching
      Concluding Remarks
      5/23/2011
      surajitc@microsoft.com
      10
    • 12. Challenge 1: Too Many Similarity Functions
      Methodology
      Choose similarity function f appropriate for the domain
      Choose best implementation of f with support for indexing
      Can we get away with a common foundation and simulate these variations?
      5/23/2011
      surajitc@microsoft.com
      11
    • 13. Challenge 2: Lack of Customizability
      Abbreviations
      USA ≈United States of America
      St ≈ Street, NE ≈ North East
      Name variations,
      Mike ≈ Michael, Bill ≈ William
      Aliases
      One ≈ 1, First ≈ 1st
      Can we inject customizability without loss of efficiency?
      5/23/2011
      surajitc@microsoft.com
      12
    • 14. Challenge 1: Too Many Similarity Functions
      5/23/2011
      surajitc@microsoft.com
      13
    • 15. Jaccard Similarity
      5/23/2011
      14
      Statistical measure
      Originally defined over sets
      String = set of words
      Range of values: [0,1]
      𝐽𝑎𝑐𝑐𝑎𝑟𝑑𝑠1, 𝑠2=|𝑠1 ∩𝑠2||𝑠1 ∪𝑠2|
       
      surajitc@microsoft.com
    • 16. Seeking a common Foundation: Jaccard Similarity
      5/23/2011
      15
      148thAve NE, Redmond, WA
      𝐽𝑎𝑐𝑐𝑎𝑟𝑑= 44+2≈0.66
       
      140thAve NE, Redmond, WA
      surajitc@microsoft.com
    • 17. Using Jaccard Similarity to Implement f
      5/23/2011
      surajitc@microsoft.com
      Check f ≥ θ
      Query
      Jacc. Sim. ≥ θ’
      String  Set
      String  Set
      Reference table
      16
      Lookup on f
      f ≥ θJacc. Sim. ≥ θ’
    • 18. Edit Similarity Set Similarity
      5/23/2011
      7/8
      Jaccard Similarity
      C,r,o,s,s,i,n,g
      C,r,o,s,i,n,g
      Crosing
      Crossing
      surajitc@microsoft.com
      If strlen(r) ≥ strlen(s):
      Edit Distance(r,s) ≤ k  Jacc. Sim(1-gram(r), 1-gram(s)) ≥ (strlen(r) - k)/(strlen(r) + k))
      17
    • 19. Inverted Index Based Approach
      5/23/2011
      100 Prairie Crossing Dr Chicago
      ≥ 0.5 M comparisons
      100
      Prairie
      Chicago
      Drive
      Crossing
      Dr
      2
      2
      2
      2
      2
      10






      0.5 M
      Rows
      18
      Rid Lists
      surajitc@microsoft.com
    • 20. Prefix Filter
      5/23/2011
      100 Prairie Crossing Dr Chicago
      100 Prairie Crossing Drive Chicago
      4
      s
      r
      1
      1
      Any size 2 subset of r has non-empty overlap with s
      19
      surajitc@microsoft.com
    • 21. Inverted Index Based Approach
      5/23/2011
      100 Prairie Crossing Dr Chicago
      Use 100 and Prairie
      100
      Prairie
      Chicago
      Drive
      Crossing
      Dr
      2
      2
      2
      2
      2
      10






      0.5 M
      Rows
      20
      Rid Lists
      surajitc@microsoft.com
    • 22. Signature based Indexing
      Use signature-based scheme to further reduce cost of indexing and index lookup
      Property: If two strings have high JC, then signatures must intersect
      LSH signatures work well
      5/23/2011
      surajitc@microsoft.com
      21
    • 23. Challenge 2: Lack of Customizability
      5/23/2011
      surajitc@microsoft.com
      22
    • 24. Normalization?
      23
      1.0
      Jaccard Similarity
      A Turing
      A Turing
      Alan A
      A Turing
      Alan Turing
      5/23/2011
      surajitc@microsoft.com
    • 25. Normalization?
      24
      1.0
      Jaccard Similarity
      A Turing
      A Turing
      Alan  A
      Aaron A
      Aaron Turing
      Alan Turing
      5/23/2011
      surajitc@microsoft.com
    • 26. Transformations
      Transformation Rules
      Xing  Crossing
      Programmable Similarity
      SetSimilarity
      W  West
      Dr  Drive
      5/23/2011
      BYU Talk
      25
    • 27. Semantics of Programmable Similarity
      Transformation Rules
      Xing  Crossing
      Programmable Similarity
      SetSimilarity
      W  West
      Dr  Drive
      Prairie Crossing Dr Chicago
      Prairie Xing Dr Chicago
      5/23/2011
      BYU Talk
      26
    • 28. Semantics: Example
      Transformation Rules
      Xing  Crossing
      Programmable Similarity
      SetSimilarity
      W  West
      Dr  Drive
      Prairie Crossing Dr Chicago
      Prairie Crossing Dr Chicago
      Prairie Xing Dr Chicago
      5/23/2011
      BYU Talk
      27
    • 29. Semantics: Example
      Transformation Rules
      Xing  Crossing
      Programmable Similarity
      SetSimilarity
      W  West
      Dr  Drive
      Prairie Crossing Dr Chicago
      Prairie CrossingDrive Chicago
      Prairie Crossing Dr Chicago
      Prairie XingDr Chicago
      5/23/2011
      BYU Talk
      28
    • 30. Semantics: Example
      Transformation Rules
      Xing  Crossing
      Programmable Similarity
      SetSimilarity
      W  West
      Dr  Drive
      Prairie Crossing Dr Chicago
      Prairie Crossing Drive Chicago
      Prairie Crossing Dr Chicago
      Prairie Xing Dr Chicago
      Prairie Xing Drive Chicago
      5/23/2011
      BYU Talk
      29
    • 31. Semantics: Example
      Transformation Rules
      Xing  Crossing
      Programmable Similarity
      SetSimilarity
      W  West
      Dr  Drive
      Prairie Crossing Dr Chicago
      Prairie Crossing Drive Chicago
      Prairie Crossing Dr Chicago
      Prairie Xing Dr Chicago
      Prairie Xing Drive Chicago
      Prairie Xing Dr Chicago
      5/23/2011
      BYU Talk
      30
    • 32. Semantics: Example
      Transformation Rules
      Xing  Crossing
      Programmable Similarity
      SetSimilarity
      W  West
      Dr  Drive
      Prairie Crossing Dr Chicago
      Prairie Crossing Drive Chicago
      Prairie Crossing Drive Chicago
      Prairie Crossing Dr Chicago
      Prairie Xing Dr Chicago
      Prairie Xing Drive Chicago
      Prairie Xing Dr Chicago
      5/23/2011
      BYU Talk
      Prairie Crossing Dr Chicago
      31
    • 33. Semantics: Example
      Transformation Rules
      Xing  Crossing
      Programmable Similarity
      SetSimilarity
      W  West
      Dr  Drive
      Prairie Crossing Dr Chicago
      Prairie Crossing Drive Chicago
      Prairie Crossing Drive Chicago
      Prairie Crossing Dr Chicago
      Prairie Xing Dr Chicago
      Prairie Xing Drive Chicago
      Prairie Xing Dr Chicago
      5/23/2011
      BYU Talk
      Prairie Crossing Dr Chicago
      32
    • 34. Semantics: Example
      1.0
      Transformation Rules
      Xing  Crossing
      Programmable Similarity
      SetSimilarity
      W  West
      Dr  Drive
      Prairie Crossing Dr Chicago
      Prairie Crossing Drive Chicago
      Prairie Crossing Drive Chicago
      Prairie Crossing Dr Chicago
      Prairie Xing Dr Chicago
      Prairie Xing Drive Chicago
      Prairie Xing Dr Chicago
      5/23/2011
      BYU Talk
      Prairie Crossing Dr Chicago
      33
    • 35. Source of Transformations
      Domain-specific authorities
      ~ 200000 rules from USPS for address matching
      Hard to capture using a black-box similarity function
      Web
      Wikipedia redirects
      Program
      First  1st, Second  2nd
      5/23/2011
      surajitc@microsoft.com
      34
    • 36. Computational Challenge: Blowup
      5/23/2011
      surajitc@microsoft.com
      1. ATT
      2. American Telephone and Telegraph
      1. Corp
      2. Corporation
      1. 100
      2. One Hundred
      3. Hundred
      4. Door 100
      ATT Corp.,
      100 Prairie Xing Dr Chicago, IL, USA
      1. Xing
      2. Crossing
      384 variations!
      1. Dr
      2. Drive
      1. IL
      2. Illinois
      1. USA
      2. United States
      3. United States of America
      35
    • 37. Similarity With Transformations: Bipartite Matching
      5/23/2011
      surajitc@microsoft.com
      Prairie
      Xing
      Dr
      Chicago
      Prairie
      Crossing
      Drive
      Chicago
      Xing  Crossing
      Max Intersection = Max Matching = 4
      W  West
      Max Jaccard = Max Intersection / (8 – Max Intersection)
      = 4/4 = 1
      Dr  Drive
      36
    • 38. Extensions to Signature based Indexing
      Use same LSH signature-based scheme to reduce cost of indexing and index lookup
      Two Properties:
      If two strings have high JC, then signatures must intersect
      All LSH signatures corresponding to generated strings can be obtained efficiently without materializing
      5/23/2011
      surajitc@microsoft.com
      37
    • 39. Challenge of Setting Thresholds
      Union
      What are the “right” thresholds?
      0.9
      0.7
      Similarity Join
      (St, City)
      Similarity Join
      (St, State,Zip)
      R (St,City,State,Zip)
      Parse Address
      S (St,City,State,Zip)
      R (Address)
      5/23/2011
      surajitc@microsoft.com
      WA  Washington
      WI  Wisconsin
      FL  Florida
      Xing  Crossing
      Xing  Crossing
      W  West
      W  West
      Dr  Drive
      Dr  Drive
      38
    • 40. Learning From Examples
      5/23/2011
      • Input
      • 41. A set of examples: matches & non-matches
      • 42. An operator tree invoking (multiple) Sim Join operations
      • 43. Goal
      • 44. Set the thresholds such that
      • 45. (Number of thresholds = no. of join columns)
      • 46. Precision Threshold : the number of false positives is less than B
      • 47. Recall is maximized: Number of correctly classified matching pairs
      • 48. Can be generalized to also choose joining columns and similarity functions
      surajitc@microsoft.com
      39
    • 49. Outline
      Introduction and Motivation
      Two Challenges in Record Matching
      Concluding Remarks
      5/23/2011
      surajitc@microsoft.com
      40
    • 50. Real-World Record Matching Task
      match against enquiries
      Katrina: Given evacuee lists…
      41
    • 51. Beyond Enterprise Data
      The Canon EOS Rebel XTi remains a very good first dSLR…
      The EOS Digital Rebel XTi is the product of Canon's extensive in-house development…
      New ThinkPad X61 Tablet models are available with Intel® Centrino® Pro processor…
      Documents
      Challenge: Pairwise Matching
      surajitc@microsoft.com
      5/23/2011
      42
      42
    • 52. Final Thoughts
      Goal: Make Application building easier
      Customizability; Efficiency
      Internal Impact of MSR’s Record Matching
      SQL Server Integration Services; Relationship Discovery in Excel PowerPivot
      Bing Maps, Bing Shopping
      Open Issues
      Design Studio for Record Matching
      Record Matching for Web Scale Problems
      Broader use of Feature engineering techniques
      5/23/2011
      43
      surajitc@microsoft.com
    • 53. Questions?
      5/23/2011
      surajitc@microsoft.com
      44
    • 54. References
      Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik, Learning String Transformations from Examples, in VLDB, Very Large Data Bases Endowment Inc., August 2009
      Surajit Chaudhuri and Raghav Kaushik, Extending Autocompletion to Tolerate Errors, in ACM SIGMOD June 2009
      Arvind Arasu, Christopher Re, and Dan Suciu, Large-Scale Deduplication with Constraints using Dedupalog, in IEEE ICDE 2009
      Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik, Transformation-based Framework for Record Matching, in IEEE ICDE 2008
      Surajit Chaudhuri, Bee Chung Chen, VenkateshGanti, and Raghav Kaushik, Example Driven Design of Efficient Record Matching Queries, in VLDB 2007
      Surajit Chaudhuri, Anish Das Sarma, VenkateshGanti, and Raghav Kaushik., Leveraging Aggregate Constraints for Deduplication, in SIGMOD 2007
      Surajit Chaudhuri, VenkateshGanti, Surajit Chaudhuri: Robust Identification of Fuzzy Duplicates., in IEEE ICDE 2005: 865-876
      Surajit Chaudhuri, Kris Ganjam, VenkateshGanti, and Surajit Chaudhuri, Robust and efficient fuzzy match for online data cleaning, in SIGMOD 2003
      5/23/2011
      surajitc@microsoft.com
      45
    • 55. Appendix: “Robust Identification of Fuzzy Duplicates” (IEEE Data Engineering, 2005)
      5/23/2011
      surajitc@microsoft.com
      46
    • 56. 47
      Deduplication
      Given a relation R, the goal is to partition R into groups such that each group consists of “duplicates” (of the same entity)
      Also called reference reconciliation, entity resolution, merge/purge
      Record matching, record linkage: identify record pairs (across relations) which are duplicates
      Important sub-goals of deduplication
    • 57. 48
      Previous Techniques
      Distance functions to abstract closeness between tuples
      E.g., edit distance, cosine similarity, etc.
      Approach 1: clustering
      Hard to determine number of clusters
      Approach 2: partition into “valid” groups
      Global threshold g
      All pairs of tuples whose distance < g are considered duplicates
      Partitioning
      Connected components in the threshold graph
    • 58. 49
      Our Approach
      Local structural properties are important for identifying sets of duplicates
      Identify two criteria to characterize local structural properties
      Formalize the duplicate elimination problem based upon these criteria
      Unique solution, rich space of solutions, impact of distance transformations, etc.
      Propose an algorithm for solving the problem
    • 59. 50
      Compact Set (CS) Criterion
      Duplicates are closer to each other than to other tuples
      A group is compact if it consists of all mutual nearest neighbors
      In {1,2,3,6,7,10,11,12}: {1,2,3}, {6,7}, {10,11,12} are compact groups
      Good distance functions for duplicate identification have the characteristic that sets of duplicates form compact sets
    • 60. 51
      nn(v)
      2∙nn(v)
      Growth spheres
      Sparse Neighborhood (SN) Criterion
      Duplicate tuples are well-separated from other tuples
      Neighborhood is “sparse”
      ng(v) = #tuples in larger sphere / #tuples in smaller sphere around v
      ng(set S of tuples) = AGG{ng(v) of each v in S}
      S is sparse if ng(S) < c
    • 61. 52
      Other Constraints
      Goal: Partition R into the minimum number of groups {G1,…,Gm} such that for all 1 ≤ i ≤ m
      Gi is a compact set and Gi is an SN group
      Can lead to unintuitive solutions
      {101, 102, 104, 201, 202, 301, 302} – 1 group!
      Size constraint: size of a group of duplicates is less than K
      Diameter constraint: diameter of a group of duplicates is less than θ

    ×