• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Error Tolerant Record Matching PVERConf_May2011
 

Error Tolerant Record Matching PVERConf_May2011

on

  • 548 views

May 2011 Personal Validation and Entity Resolution Conference. Presenter: Surajit Chaudhuri, Microsoft Research

May 2011 Personal Validation and Entity Resolution Conference. Presenter: Surajit Chaudhuri, Microsoft Research

Statistics

Views

Total Views
548
Views on SlideShare
548
Embed Views
0

Actions

Likes
0
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Thanks for the generous introduction. It is a great pleasure to speak to you. My talk today centers around the work that has been done as part of our Data Cleaning project. Those in this list really did all the hard work and so I would like to acknowledge their contributions.
  • Instead of trying to define what is data cleaning and approximate match, let me motivate the challenges in this domain through a few examples. Here is one all of us are familiar with where we type an address, perhaps with some errors, as in the example above, and lookup a set of addresses and get directions. So, what you would like the system to do is to do an approximate match.
  • Here is another familiar example. Here is a screenshot of Microsoft’s Product Search that aggregates offers from multiple retailers for products, like CNET.com or Amazon.com. You want to ideally make sure that we recognize multiple offers from same product so that the consumer can compare prices – so the two highlighted boxes should really come together because these two records represent the same entity. We are yet not good at doing this (because they unlike Windows Local Live don’t use technology from my group) but id
  • The key challenge therefore is to be able to do this at scale. This is because you may have a large number of addresses.
  • Do not mention that all of the 17 functions are useful in different applications. Rather, stress that we need support for more than one. Just picking one similarity function will not work.
  • Do not mention that all of the 17 functions are useful in different applications. Rather, stress that we need support for more than one. Just picking one similarity function will not work.
  • If the edit distance between two strings is small, then the jacc. similarity between their q-gram sets (in this instance, 1-gram sets) is large.
  • Mention that the lists are rid-lists. So 2, 10, etc. are record ids.
  • SupposeJaccSim (r,s) >= 2/3. If we pick >1/3 elements of r, at least 1 must also be in s.
  • Mention that the lists are rid-lists. So 2, 10, etc. are record ids.
  • Here is an example. On the left side, we list some example entities from msn shopping database, and on the right side, there are some web documents talking about those products. As we can see, people have many different ways to refer to a product, and the description they use are often different from the database version. Exact matching will fail to catch those mentions. To address this problem, we use approximate match that compute the similarity score between sub-strings and entities, as soon as their similarity score exceeds a threshold T, we consider it as a match.
  • Here are some key uses of our software. Bing Maps uses Fuzzy Lookup at the front end for matching user queries against landmarks, and also at the back-end to de-dupe yellow page feeds.Bing Shopping uses our software for back-end de-duplication of product names and descriptions. There are other key uses of our software as listed on the slide which I will skip over.

Error Tolerant Record Matching PVERConf_May2011 Error Tolerant Record Matching PVERConf_May2011 Presentation Transcript

  • Error Tolerant Record Matching
    Surajit Chaudhuri
    Microsoft Research
  • Key Contributors
    Sanjay Agrawal
    Arvind Arasu
    Zhimin Chen
    Kris Ganjam
    Venky Ganti
    Raghav Kaushik
    Christian Konig
    Rajeev Motwani (Stanford)
    VivekNarasayya
    Dong Xin
    5/23/2011
    surajitc@microsoft.com
    2
  • 5/23/2011
    Analysis Services
    Query / Reporting
    Data Mining
    Data Warehousing & Business Intelligence
    Data Warehouse
    Extract - Transform – Load
    External Source
    surajitc@microsoft.com
    3
  • Bing Maps
    5/23/2011
    surajitc@microsoft.com
    4
  • Bing Shopping
    5/23/2011
    surajitc@microsoft.com
    5
  • OBJECTIVE: Reduce Cost of building a data cleaning application
    5/23/2011
    surajitc@microsoft.com
    6
  • Our Approach to Data Cleaning
    Address Matching
    Product De-duplication
    Focus of this talk
    Windows Live Products
    Local Live
    Record
    Matching
    Parsing
    Design Tools
    De-duplication
    Core Operators
    surajitc@microsoft.com
    5/23/2011
    7
  • Challenge: Record Matching over Large Data Sets
    5/23/2011
    surajitc@microsoft.com
    Reference table of addresses
    Prairie Crosing Dr
    W Chicago IL 60185
    Large Table
    (~10M Rows)
    8
  • Efficient Indexing is Needed
    5/23/2011
    surajitc@microsoft.com
    • Needed for Efficiency & Scalability
    • Specific to similarity function
    Reference table
    Prairie Crosing Dr
    W Chicago IL 60185
    Find all rows sj such that
    Sim (r, sj) ≥ θ
    Large Table
    (~10M Rows)
    9
  • Outline
    Introduction and Motivation
    Two Challenges in Record Matching
    Concluding Remarks
    5/23/2011
    surajitc@microsoft.com
    10
  • Challenge 1: Too Many Similarity Functions
    Methodology
    Choose similarity function f appropriate for the domain
    Choose best implementation of f with support for indexing
    Can we get away with a common foundation and simulate these variations?
    5/23/2011
    surajitc@microsoft.com
    11
  • Challenge 2: Lack of Customizability
    Abbreviations
    USA ≈United States of America
    St ≈ Street, NE ≈ North East
    Name variations,
    Mike ≈ Michael, Bill ≈ William
    Aliases
    One ≈ 1, First ≈ 1st
    Can we inject customizability without loss of efficiency?
    5/23/2011
    surajitc@microsoft.com
    12
  • Challenge 1: Too Many Similarity Functions
    5/23/2011
    surajitc@microsoft.com
    13
  • Jaccard Similarity
    5/23/2011
    14
    Statistical measure
    Originally defined over sets
    String = set of words
    Range of values: [0,1]
    𝐽𝑎𝑐𝑐𝑎𝑟𝑑𝑠1, 𝑠2=|𝑠1 ∩𝑠2||𝑠1 ∪𝑠2|
     
    surajitc@microsoft.com
  • Seeking a common Foundation: Jaccard Similarity
    5/23/2011
    15
    148thAve NE, Redmond, WA
    𝐽𝑎𝑐𝑐𝑎𝑟𝑑= 44+2≈0.66
     
    140thAve NE, Redmond, WA
    surajitc@microsoft.com
  • Using Jaccard Similarity to Implement f
    5/23/2011
    surajitc@microsoft.com
    Check f ≥ θ
    Query
    Jacc. Sim. ≥ θ’
    String  Set
    String  Set
    Reference table
    16
    Lookup on f
    f ≥ θJacc. Sim. ≥ θ’
  • Edit Similarity Set Similarity
    5/23/2011
    7/8
    Jaccard Similarity
    C,r,o,s,s,i,n,g
    C,r,o,s,i,n,g
    Crosing
    Crossing
    surajitc@microsoft.com
    If strlen(r) ≥ strlen(s):
    Edit Distance(r,s) ≤ k  Jacc. Sim(1-gram(r), 1-gram(s)) ≥ (strlen(r) - k)/(strlen(r) + k))
    17
  • Inverted Index Based Approach
    5/23/2011
    100 Prairie Crossing Dr Chicago
    ≥ 0.5 M comparisons
    100
    Prairie
    Chicago
    Drive
    Crossing
    Dr
    2
    2
    2
    2
    2
    10






    0.5 M
    Rows
    18
    Rid Lists
    surajitc@microsoft.com
  • Prefix Filter
    5/23/2011
    100 Prairie Crossing Dr Chicago
    100 Prairie Crossing Drive Chicago
    4
    s
    r
    1
    1
    Any size 2 subset of r has non-empty overlap with s
    19
    surajitc@microsoft.com
  • Inverted Index Based Approach
    5/23/2011
    100 Prairie Crossing Dr Chicago
    Use 100 and Prairie
    100
    Prairie
    Chicago
    Drive
    Crossing
    Dr
    2
    2
    2
    2
    2
    10






    0.5 M
    Rows
    20
    Rid Lists
    surajitc@microsoft.com
  • Signature based Indexing
    Use signature-based scheme to further reduce cost of indexing and index lookup
    Property: If two strings have high JC, then signatures must intersect
    LSH signatures work well
    5/23/2011
    surajitc@microsoft.com
    21
  • Challenge 2: Lack of Customizability
    5/23/2011
    surajitc@microsoft.com
    22
  • Normalization?
    23
    1.0
    Jaccard Similarity
    A Turing
    A Turing
    Alan A
    A Turing
    Alan Turing
    5/23/2011
    surajitc@microsoft.com
  • Normalization?
    24
    1.0
    Jaccard Similarity
    A Turing
    A Turing
    Alan  A
    Aaron A
    Aaron Turing
    Alan Turing
    5/23/2011
    surajitc@microsoft.com
  • Transformations
    Transformation Rules
    Xing  Crossing
    Programmable Similarity
    SetSimilarity
    W  West
    Dr  Drive
    5/23/2011
    BYU Talk
    25
  • Semantics of Programmable Similarity
    Transformation Rules
    Xing  Crossing
    Programmable Similarity
    SetSimilarity
    W  West
    Dr  Drive
    Prairie Crossing Dr Chicago
    Prairie Xing Dr Chicago
    5/23/2011
    BYU Talk
    26
  • Semantics: Example
    Transformation Rules
    Xing  Crossing
    Programmable Similarity
    SetSimilarity
    W  West
    Dr  Drive
    Prairie Crossing Dr Chicago
    Prairie Crossing Dr Chicago
    Prairie Xing Dr Chicago
    5/23/2011
    BYU Talk
    27
  • Semantics: Example
    Transformation Rules
    Xing  Crossing
    Programmable Similarity
    SetSimilarity
    W  West
    Dr  Drive
    Prairie Crossing Dr Chicago
    Prairie CrossingDrive Chicago
    Prairie Crossing Dr Chicago
    Prairie XingDr Chicago
    5/23/2011
    BYU Talk
    28
  • Semantics: Example
    Transformation Rules
    Xing  Crossing
    Programmable Similarity
    SetSimilarity
    W  West
    Dr  Drive
    Prairie Crossing Dr Chicago
    Prairie Crossing Drive Chicago
    Prairie Crossing Dr Chicago
    Prairie Xing Dr Chicago
    Prairie Xing Drive Chicago
    5/23/2011
    BYU Talk
    29
  • Semantics: Example
    Transformation Rules
    Xing  Crossing
    Programmable Similarity
    SetSimilarity
    W  West
    Dr  Drive
    Prairie Crossing Dr Chicago
    Prairie Crossing Drive Chicago
    Prairie Crossing Dr Chicago
    Prairie Xing Dr Chicago
    Prairie Xing Drive Chicago
    Prairie Xing Dr Chicago
    5/23/2011
    BYU Talk
    30
  • Semantics: Example
    Transformation Rules
    Xing  Crossing
    Programmable Similarity
    SetSimilarity
    W  West
    Dr  Drive
    Prairie Crossing Dr Chicago
    Prairie Crossing Drive Chicago
    Prairie Crossing Drive Chicago
    Prairie Crossing Dr Chicago
    Prairie Xing Dr Chicago
    Prairie Xing Drive Chicago
    Prairie Xing Dr Chicago
    5/23/2011
    BYU Talk
    Prairie Crossing Dr Chicago
    31
  • Semantics: Example
    Transformation Rules
    Xing  Crossing
    Programmable Similarity
    SetSimilarity
    W  West
    Dr  Drive
    Prairie Crossing Dr Chicago
    Prairie Crossing Drive Chicago
    Prairie Crossing Drive Chicago
    Prairie Crossing Dr Chicago
    Prairie Xing Dr Chicago
    Prairie Xing Drive Chicago
    Prairie Xing Dr Chicago
    5/23/2011
    BYU Talk
    Prairie Crossing Dr Chicago
    32
  • Semantics: Example
    1.0
    Transformation Rules
    Xing  Crossing
    Programmable Similarity
    SetSimilarity
    W  West
    Dr  Drive
    Prairie Crossing Dr Chicago
    Prairie Crossing Drive Chicago
    Prairie Crossing Drive Chicago
    Prairie Crossing Dr Chicago
    Prairie Xing Dr Chicago
    Prairie Xing Drive Chicago
    Prairie Xing Dr Chicago
    5/23/2011
    BYU Talk
    Prairie Crossing Dr Chicago
    33
  • Source of Transformations
    Domain-specific authorities
    ~ 200000 rules from USPS for address matching
    Hard to capture using a black-box similarity function
    Web
    Wikipedia redirects
    Program
    First  1st, Second  2nd
    5/23/2011
    surajitc@microsoft.com
    34
  • Computational Challenge: Blowup
    5/23/2011
    surajitc@microsoft.com
    1. ATT
    2. American Telephone and Telegraph
    1. Corp
    2. Corporation
    1. 100
    2. One Hundred
    3. Hundred
    4. Door 100
    ATT Corp.,
    100 Prairie Xing Dr Chicago, IL, USA
    1. Xing
    2. Crossing
    384 variations!
    1. Dr
    2. Drive
    1. IL
    2. Illinois
    1. USA
    2. United States
    3. United States of America
    35
  • Similarity With Transformations: Bipartite Matching
    5/23/2011
    surajitc@microsoft.com
    Prairie
    Xing
    Dr
    Chicago
    Prairie
    Crossing
    Drive
    Chicago
    Xing  Crossing
    Max Intersection = Max Matching = 4
    W  West
    Max Jaccard = Max Intersection / (8 – Max Intersection)
    = 4/4 = 1
    Dr  Drive
    36
  • Extensions to Signature based Indexing
    Use same LSH signature-based scheme to reduce cost of indexing and index lookup
    Two Properties:
    If two strings have high JC, then signatures must intersect
    All LSH signatures corresponding to generated strings can be obtained efficiently without materializing
    5/23/2011
    surajitc@microsoft.com
    37
  • Challenge of Setting Thresholds
    Union
    What are the “right” thresholds?
    0.9
    0.7
    Similarity Join
    (St, City)
    Similarity Join
    (St, State,Zip)
    R (St,City,State,Zip)
    Parse Address
    S (St,City,State,Zip)
    R (Address)
    5/23/2011
    surajitc@microsoft.com
    WA  Washington
    WI  Wisconsin
    FL  Florida
    Xing  Crossing
    Xing  Crossing
    W  West
    W  West
    Dr  Drive
    Dr  Drive
    38
  • Learning From Examples
    5/23/2011
    • Input
    • A set of examples: matches & non-matches
    • An operator tree invoking (multiple) Sim Join operations
    • Goal
    • Set the thresholds such that
    • (Number of thresholds = no. of join columns)
    • Precision Threshold : the number of false positives is less than B
    • Recall is maximized: Number of correctly classified matching pairs
    • Can be generalized to also choose joining columns and similarity functions
    surajitc@microsoft.com
    39
  • Outline
    Introduction and Motivation
    Two Challenges in Record Matching
    Concluding Remarks
    5/23/2011
    surajitc@microsoft.com
    40
  • Real-World Record Matching Task
    match against enquiries
    Katrina: Given evacuee lists…
    41
  • Beyond Enterprise Data
    The Canon EOS Rebel XTi remains a very good first dSLR…
    The EOS Digital Rebel XTi is the product of Canon's extensive in-house development…
    New ThinkPad X61 Tablet models are available with Intel® Centrino® Pro processor…
    Documents
    Challenge: Pairwise Matching
    surajitc@microsoft.com
    5/23/2011
    42
    42
  • Final Thoughts
    Goal: Make Application building easier
    Customizability; Efficiency
    Internal Impact of MSR’s Record Matching
    SQL Server Integration Services; Relationship Discovery in Excel PowerPivot
    Bing Maps, Bing Shopping
    Open Issues
    Design Studio for Record Matching
    Record Matching for Web Scale Problems
    Broader use of Feature engineering techniques
    5/23/2011
    43
    surajitc@microsoft.com
  • Questions?
    5/23/2011
    surajitc@microsoft.com
    44
  • References
    Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik, Learning String Transformations from Examples, in VLDB, Very Large Data Bases Endowment Inc., August 2009
    Surajit Chaudhuri and Raghav Kaushik, Extending Autocompletion to Tolerate Errors, in ACM SIGMOD June 2009
    Arvind Arasu, Christopher Re, and Dan Suciu, Large-Scale Deduplication with Constraints using Dedupalog, in IEEE ICDE 2009
    Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik, Transformation-based Framework for Record Matching, in IEEE ICDE 2008
    Surajit Chaudhuri, Bee Chung Chen, VenkateshGanti, and Raghav Kaushik, Example Driven Design of Efficient Record Matching Queries, in VLDB 2007
    Surajit Chaudhuri, Anish Das Sarma, VenkateshGanti, and Raghav Kaushik., Leveraging Aggregate Constraints for Deduplication, in SIGMOD 2007
    Surajit Chaudhuri, VenkateshGanti, Surajit Chaudhuri: Robust Identification of Fuzzy Duplicates., in IEEE ICDE 2005: 865-876
    Surajit Chaudhuri, Kris Ganjam, VenkateshGanti, and Surajit Chaudhuri, Robust and efficient fuzzy match for online data cleaning, in SIGMOD 2003
    5/23/2011
    surajitc@microsoft.com
    45
  • Appendix: “Robust Identification of Fuzzy Duplicates” (IEEE Data Engineering, 2005)
    5/23/2011
    surajitc@microsoft.com
    46
  • 47
    Deduplication
    Given a relation R, the goal is to partition R into groups such that each group consists of “duplicates” (of the same entity)
    Also called reference reconciliation, entity resolution, merge/purge
    Record matching, record linkage: identify record pairs (across relations) which are duplicates
    Important sub-goals of deduplication
  • 48
    Previous Techniques
    Distance functions to abstract closeness between tuples
    E.g., edit distance, cosine similarity, etc.
    Approach 1: clustering
    Hard to determine number of clusters
    Approach 2: partition into “valid” groups
    Global threshold g
    All pairs of tuples whose distance < g are considered duplicates
    Partitioning
    Connected components in the threshold graph
  • 49
    Our Approach
    Local structural properties are important for identifying sets of duplicates
    Identify two criteria to characterize local structural properties
    Formalize the duplicate elimination problem based upon these criteria
    Unique solution, rich space of solutions, impact of distance transformations, etc.
    Propose an algorithm for solving the problem
  • 50
    Compact Set (CS) Criterion
    Duplicates are closer to each other than to other tuples
    A group is compact if it consists of all mutual nearest neighbors
    In {1,2,3,6,7,10,11,12}: {1,2,3}, {6,7}, {10,11,12} are compact groups
    Good distance functions for duplicate identification have the characteristic that sets of duplicates form compact sets
  • 51
    nn(v)
    2∙nn(v)
    Growth spheres
    Sparse Neighborhood (SN) Criterion
    Duplicate tuples are well-separated from other tuples
    Neighborhood is “sparse”
    ng(v) = #tuples in larger sphere / #tuples in smaller sphere around v
    ng(set S of tuples) = AGG{ng(v) of each v in S}
    S is sparse if ng(S) < c
  • 52
    Other Constraints
    Goal: Partition R into the minimum number of groups {G1,…,Gm} such that for all 1 ≤ i ≤ m
    Gi is a compact set and Gi is an SN group
    Can lead to unintuitive solutions
    {101, 102, 104, 201, 202, 301, 302} – 1 group!
    Size constraint: size of a group of duplicates is less than K
    Diameter constraint: diameter of a group of duplicates is less than θ