Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Data Cleaning andTransformation – RecordLinkage Helena Galhardas DEI IST (based on the slides: “A Survey of Data Quality Issues in Cooperative Information Systems”, Carlo Batini, Tiziana Catarci, Monica Scannapieco, 23rd International Conference on Conceptual Modelling (ER 2004) and “Searching and Integrating Information on the Web, Seminar 3”, Chen Li, Univ. Irvine)
  2. 2. Agenda Motivation Introduction Details on specific steps  Pre-processing  Comparison functions  Types of blocking methods  Sorted Neighborhood Method  Approximate String Joins over RDBMS
  3. 3. Motivation Correlate data from different data sources  Data is often dirty  Needs to be cleansed before being used Example:  A hospital needs to merge patient records from different data sources  They have different formats, typos, and abbreviations
  4. 4. Example Table R Table S Name SSN Addr Name SSN AddrJack Lemmon 430-871-8294 Maple St Ton Hanks 234-162-1234 Main StreetHarrison Ford 292-918-2913 Culver Blvd Kevin Spacey 928-184-2813 Frost BlvdTom Hanks 234-762-1234 Main St Jack Lemon 430-817-8294 Maple Street … … … … … …  Find records from different datasets that could be the same entity
  5. 5. Another Example P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): 25-40(1981) Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981
  6. 6. Type of data considered (Two) formatted tables  Homogeneous in common attributes  Heterogeneous  Format (Full name vs Acronym)  Semantics (e.g. Age vs Date of Birth)  Errors (Two) groups of tables  Dimensional hierarchy (Two) XML documents
  7. 7. Record linkage and its evolution towardsobject identification in databases …. Record linkage First record and second record represent the same aspect of reality? Crlo Batini 55 Carlo Btini 54 Object identification in databases First group of records and second group of records represent the same aspect of reality? Crlo Batini Pescara Pscara Itly Europe Carlo Batini Pscara Pscara Italy Europe
  8. 8. …and object identification insemistructured documents <country> <name> United States of America </name> <cities> New York, Los Angeles, Chicago </cities> <lakes> <name> Lake Michigan </name> </lakes> </country> and <country> United States <city> New York </city> are the same <city> Los Angeles </city> object? <lakes> <lake> Lake Michigan </lake> </lakes>
  9. 9. Record linkageProblem statement:“Given two relations, identify the potentially matched records  Efficiently and  Effectively” Also known as: approximated duplicate detection, entity resolution/identification, etc
  10. 10. Challenges How to define good similarity functions?  Many functions proposed (edit distance, cosine similarity, …)  Domain knowledge is critical  Names: “Wall Street Journal” and “LA Times”  Address: “Main Street” versus “Main St”  Record-oriented: A pair of records with different fields is considered How to do matching efficiently  Offline join version  Online (interactive) search  Nearest search  Range search  Record-set oriented: A potentially large set (or two sets )of records needs to be compared
  11. 11. Introduction
  12. 12. Relevant steps of record linkage techs. AssignmentInputfiles Initial Search space Match Reduced search A Space S’ Search Decision Possible AxB space S’ A x B model matched reduction B Unmatch
  13. 13. General strategy0. Preprocessing  Standardize fields to compare and correct simple errors1. Estabilish a search space reduction method (also called blocking method)  Given the search space S = A x B of the two files, find a new search space S’ contained in S, to apply further steps.2. Choose comparison function  Choose the function/set of rules that express the distance between pairs of records in S’3. Choose decision model  Choose the method for assigning pairs in S’ to M, the set of matching records, U the set of unmatching records, and P the set of possible matches4. Check effectiveness of method
  14. 14. Phases - [Preprocessing] - Estabilish a blocking method - Compute comparison function - Apply decision model - [Check effectiveness of method]
  15. 15. Details on specific steps
  16. 16. Data preprocessing (preparation) Parsing: locates, identifies, and isolates individual fields in the source files  Ex: parsing of name and address components into consistent packets of information Data transformation: simple conversions applied to te data in order for them to conform to the data types of the corresponding domains  Ex: data type conversions, range checking Data standardization: normalization of the information represented in certain fields to a specific content format  Ex: address information, date and time formatting, name and title formatting
  17. 17. Comparison functions It is not easy to match duplicate records even after parsing, data standardization, and identification of similar fields  Misspellings and different conventions for recording the same information still result in different, multiple representations of a unique object in the database.Field matching techniques: for measuring the similarity of individual fieldsDetection of duplicate records: for measuring the similarity of entire records
  18. 18. (String) field matching functionsCharacter-based similarity metrics: handle typographical errors wellToken-based similarity metrics: handle typographical conventions that lead to rearrangement of words  Ex: “John Smith” vs “Smith, John”Phonetic similarity measures: handle similarity between strings that are not similar at a character or token level  Ex: “Kageonne” and “Cajun”
  19. 19. Character-based similarity metrics (1) Hamming distance - counts the number of mismatches between two numbers or text strings (fixed length strings) Ex: The Hamming distance between “00185” and “00155” is 1, because there is one mismatch Edit distance - the minimum cost to convert one of the strings to the other by a sequence of character insertions, deletions and replacements. Each one of these modifications is assigned a cost value. Ex: the edit distance between “Smith” and “Sitch” is 2, since “Smith” is obtained by adding “m” and deleting “c” from “Sitch”. Several variations/improvements: Affine Gap Distance, Smith- Waterman distance, etc
  20. 20. Character-based similarity metrics (2)Jaro’s algorithm or distance finds the number of common characters and the number of transposed characters in the two strings. Jaro distance accounts for insertions, deletions, and transpositions. Common characters are all characters s1[i] and s2[j], for which s1[i] =s2[j] and |i-j|<= 1/2min(|s1|, |s2|) , which means characters that appear in both strings within a distance of half the length of the shorter string. A transposed character is a common character that appears in different positions.Ex: Comparing “Smith” and “Simth”, there are five common characters, two of which are transposedJaro(s1, s2) = 1/3(Nc/|s1|+Nc/|s2|+(Nc-Nt/2)/Nc)Where Nc is the nb of comon characters between the two strings and Nt is the number of transpositions
  21. 21. Character-based similarity metrics (3) N-grams comparison function forms a vector as the set of all the substrings of length n for each string. The n- gram representation of a string has a non-zero component vector for each n-letter substring it contains, where the magnitude is equal to the number of times the substring occurs. The distance between the two strings is defined as the square radix of the distance between the two vectors.
  22. 22. Token-based similarity metrics (1)TF-IDF (Token Frequency-Inverse Document Frequency) or cosine similarity assigns higher weights to tokens appearing frequently in a document (TF weight) and lower weights to tokens that appear frequently in the whole set of documents (IDF weight). TFw is the number of times that w appears in the field IDFw = |D|/nw where nw is the number of records in the database D that contain the word w Separates each string s into words and assign a weight to each word w: vs(w) = log(TFw+1)*log(IDFw)The cosine similarity of s1 and s2 is: sim(s1, s2) = j=1|D| vs1(j)*vs2(j)/||vs1||*||vs2||
  23. 23. Token-based similarity metrics (2) Widely used for matching strings in documents The cosine similarity works well for a large variety of entries and is insensitive to the location of words  Ex: “John Smith” is equivalent to “Smith, John” The introduction of frequent words only minimally affects the similarity of two strings  Ex: “John Smith” and “Mr John Smith” have similarity close to one. Does not capture word spelling errors  Ex: “Compter Science Departament” and “Deprtment of Computer Scence” wil have similarity equal to zero
  24. 24. Phonetic similarity metricsSoundex: the most common phonetic coding scheme. Based on the assignment of identical code digits to phonetically similar groups of consonants and is used mainly to match surnames.
  25. 25. Detection of duplicate records Two categories of methods for matching records with multiple fields:  Approaches that rely on training data to learn how to match the records  Some probabilistic approaches  Ex: [Fellegi and Sunter 67]  Supervised machine learning techniques  Ex: [Bilenko et al 03]  Approaches that rely on domain knowledge or on generic distance metrics to match records  Approaches that use declarative languages for matching  Ex: AJAX  Approaches that devise distance metrics suitable for duplicate detection  Ex: [Monge and Elkan 96]
  26. 26. General strategy0. Preprocessing  Standardize fields to compare and correct simple errors 1. Estabilish a search space reduction method (also called blocking method)  Given the search space S = A x B of the two files, find a new search space S’ contained in S, to apply further steps.2. Choose comparison function  Choose the function/set of rules that express the distance between pairs of records in S’3. Choose decision model  Choose the method for assigning pairs in S’ to M, the set of matching records, U the set of unmatching records, and P the set of possible matches4. Check effectiveness of method
  27. 27. Types of Blocking/Searching methods (1) Blocking - partition of the file into mutually exclusive blocks.  Comparisons are restricted to records within each block. Blocking can be implemented by:  Sorting the file according to a block key (chosen by an expert).  Hashing - a record is hashed according to its block key in a hash block. Only records in the same hash block are considered for comparison.
  28. 28. Types of Blocking/Searching methods(2) Sorted neighbour or Windowing – moving a window of a specific size w over the data file, comparing only the records that belong to this window Multiple windowing - Several scans, each of which uses a different sorting key may be applied to increase the possibility of combining matched records Windowing with choice of key based on the quality of the key
  29. 29. Sorted Neighbour searching method1. Create Key: Compute a key K for each record in the list by extracting relevant fields or portions of fields.  Relevance is decided by experts.2. Sort Data: Sort the records in the data list using K3. Merge: Move a fixed size window through the sequential list of records limiting the comparisons for matching records to those records in the window. If the size of the window is w records, then every new record entering the window is compared with the previous w – 1 records to find “matching” records
  30. 30. SNM: Create key Compute a key for each record by extracting relevant fields or portions of fields Example: First Last Address ID Key Sal Stolfo 123 First Street 45678987 STLSAL123FRST456
  31. 31. SNM: Sort Data Sort the records in the data list using the key in step 1 This can be very time consuming  O(NlogN) for a good algorithm,  O(N2) for a bad algorithm
  32. 32. SNM: Merge records Move a fixed size window through the sequential list of records. This limits the comparisons to the records in the window
  33. 33. SNM: Considerations What is the optimal window size while  Maximizing accuracy  Minimizing computational cost Execution time for large DB will be bound by  Disk I/O  Number of passes over the data set
  34. 34. Selection of Keys The effectiveness of the SNM highly depends on the key selected to sort the records A key is defined to be a sequence of a subset of attributes Keys must provide sufficient discriminating power
  35. 35. Example of Records and Keys First Last Address ID Key Sal Stolfo 123 First Street 45678987 STLSAL123FRST456 Sal Stolfo 123 First Street 45678987 STLSAL123FRST456 Sal Stolpho 123 First Street 45678987 STLSAL123FRST456 Sal Stiles 123 Forest Street 45654321 STLSAL123FRST456
  36. 36. Equational Theory The comparison during the merge phase is an inferential process Compares much more information than simply the key The more information there is, the better inferences can be made
  37. 37. Equational Theory - Example Two names are spelled nearly identically and have the same address  It may be inferred that they are the same person Two social security numbers are the same but the names and addresses are totally different  Could be the same person who moved  Could be two different people and there is an error in the social security number
  38. 38. A simplified rule in EnglishGiven two records, r1 and r2IF the last name of r1 equals the last name of r2,AND the first names differ slightly,AND the address of r1 equals the address of r2THEN r1 is equivalent to r2
  39. 39. The distance function A “distance function” is used to compare pieces of data (usually text) Apply “distance function” to data that “differ slightly” Select a threshold to capture obvious typographical errors.  Impacts number of successful matches and number of false positives
  40. 40. Examples of matched records SSN Name (First, Initial, Last) Address334600443 Lisa Boardman 144 Wars St.334600443 Lisa Brown 144 Ward St.525520001 Ramon Bonilla 38 Ward St.525520001 Raymond Bonilla 38 Ward St. 0 Diana D. Ambrosion 40 Brik Church Av. 0 Diana A. Dambrosion 40 Brick Church Av.789912345 Kathi Kason 48 North St.879912345 Kathy Kason 48 North St.879912345 Kathy Kason 48 North St.879912345 Kathy Smith 48 North St.
  41. 41. Building an equational theory The process of creating a good equational theory is similar to the process of creating a good knowledge-base for an expert system In complex problems, an expert’s assistance is needed to write the equational theory
  42. 42. Transitive Closure In general, no single pass (i.e. no single key) will be sufficient to catch all matching records An attribute that appears first in the key has higher discriminating power than those appearing after them  If an employee has two records in a DB with SSN 193456782 and 913456782, it’s unlikely they will fall under the same window
  43. 43. Transitive Closure To increase the number of similar records merged  Widen the scanning window size, w  Execute several independent runs of the SNM  Use a different key each time  Use a relatively small window  Call this the Multi-Pass approach
  44. 44. Multi-pass approach Each independent run of the Multi-Pass approach will produce a set of pairs of records Although one field in a record may be in error, another field may not Transitive closure can be applied to those pairs to be merged
  45. 45. Multi-pass MatchesPass 1 (Lastname discriminates)KSNKAT48NRTH789 (Kathi Kason 789912345 )KSNKAT48NRTH879 (Kathy Kason 879912345 )Pass 2 (Firstname discriminates)KATKSN48NRTH789 (Kathi Kason 789912345 )KATKSN48NRTH879 (Kathy Kason 879912345 )Pass 3 (Address discriminates)48NRTH879KSNKAT (Kathy Kason 879912345 )48NRTH879SMTKAT (Kathy Smith 879912345 )
  46. 46. Transitive Equality ExampleIF A implies B AND B implies C THEN A implies CFrom example:789912345 Kathi Kason 48 North St. (A)879912345 Kathy Kason 48 North St. (B)879912345 Kathy Smith 48 North St. (C)
  47. 47. Another approach: Approximate String Joins [Gravano 2001] We want to join tuples with “similar” string fields Similarity measure: Edit Distance Each Insertion, Deletion, Replacement increases distance by one Service A Service B Jenny Stamatopoulou Panos Ipirotis K=1 John Paul McDougal Jonh Smith K=2 Aldridge Rodriguez … Panos Ipeirotis Jenny Stamatopulou K=1 John Smith John P. McDougal K=3 … … … Al Dridge Rodriguez K=1
  48. 48. Focus: Approximate String Joinsover Relational DBMSs Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K Solve the problem in a database-friendly way (if possible with an existing RDBMS)
  49. 49. Current Approaches for ProcessingApproximate String JoinsNo native support for approximate joins in RDBMSsTwo existing (straightforward) solutions: Join data outside of DBMS Join data via user-defined functions (UDFs) inside the DBMS
  50. 50. Approximate String Joins outside of a DBMS1. Export data2. Join outside of DBMS3. Import the resultMain advantage: We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionalityDisadvantages: Substantial amounts of data to be exported/imported Cannot be easily integrated with further processing steps in the DBMS
  51. 51. Approximate String Joins with UDFs1. Write a UDF to check if two strings match within distance K2. Write an SQL statement that applies the UDF to the string pairsSELECT R.stringAttr, S.stringAttrFROM R, SWHERE edit_distance(R.stringAttr, S.stringAttr, K)Main advantage: Ease of implementationMain disadvantage: UDF applied to entire cross-product of relations
  52. 52. Our Approach: Approximate String Joins over an Unmodified RDBMS1. Preprocess data and generate auxiliary tables2. Perform join exploiting standard RDBMS capabilitiesAdvantages No modification of underlying RDBMS needed. Can leverage the RDBMS query optimizer. Much more efficient than the approach based on naive UDFs
  53. 53. Intuition and Roadmap Intuition:  Similar strings have many common substrings  Use exact joins to perform approximate joins (current DBMSs are good for exact joins)  A good candidate set can be verified for false positives Roadmap:  Break strings into substrings of length q (q-grams)  Perform an exact join on the q-grams  Find candidate string pairs based on the results  Check only candidate pairs with a UDF to obtain final answer
  54. 54. What is a “Q-gram”? Q-gram: A sequence of q characters of the original string Example for q=3 vacations {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$} String with length L → L + q - 1 q-grams Similar strings have a many common q-grams
  55. 55. Q-grams and Edit Distance Operations  With no edits: L + q - 1 common q-grams  Replacement: (L + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$}  Insertion: (Lmax + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$}  Deletion: (Lmax + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$}
  56. 56. Number of Common Q-grams andEdit Distance For Edit Distance = K, there could be at most K replacements, insertions, deletions Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q-grams in common Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals)
  57. 57. Using a DBMS for Q-gram Joins If we have the q-grams in the DBMS, we can perform this counting efficiently. Create auxiliary tables with tuples of the form: <sid, qgram> and join these tables A GROUP BY – HAVING COUNT clause can perform the counting / filtering
  58. 58. Eliminating Candidate Pairs: COUNT FILTERINGSQL for this filter: (parts omitted for clarity)SELECT R.sid, S.sidFROM R, SWHERE R.qgram=S.qgramGROUP BY R.sid, S.sidHAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) – K*qThe result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives.
  59. 59. Eliminating Candidate Pairs Further: LENGTH FILTERINGStrings with length difference larger than K cannot be within Edit Distance KSELECT R.sid, S.sidFROM R, SWHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=KGROUP BY R.sid, S.sidHAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as LENGTH FILTERING
  60. 60. Exploiting Q-gram Positions for Filtering  Consider strings aabbzzaacczz and aacczzaabbzz  Strings are at edit distance 4  Strings have identical q-grams for q=3 Problem: Matching q-grams that are at different positions in both strings  Either q-grams do not "originate" from same q-gram, or  Too many edit operations "caused" spurious q-grams at various parts of strings to match
  61. 61. Example of positional q-gramsPositional q-grams of length q=3 for john_smith{(1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn_), (6,n_s), (7,_sm), (8,smi), (9,mit), (10, ith), (11,th$), (12, h$$)}Positional q-grams of length q=3 for john_a_smith{(1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn_), (6,n_a), (7,_a_), (8,a_s), (9,_sm), (10,smi), (11,mit), (12,ith), (13,th$), (14,h$$)} Ignoring positional information, the two strings have 11 n-grams in common Only the five first positional q-grams are common in both strings An additional six positional q-grams in the two strings differ in their position by just two positions
  62. 62. POSITION FILTERING - Filtering using positions Keep the position of the q-grams <sid, strlen, pos, qgram> Do not match q-grams that are more than K positions awaySELECT R.sid, S.sidFROM R, SWHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K AND abs(R.pos - S.pos)<=KGROUP BY R.sid, S.sidHAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as POSITION FILTERING
  63. 63. The Actual, Complete SQL Statement SELECT R1.string, S1.string, R1.sid, S1.sid FROM R1, S1, R, S, WHERE R1.sid=R.sid AND S1.sid=S.sid AND R.qgram=S.qgram AND abs(strlen(R1.string)–strlen(S1.string))<=K AND abs(R.pos - S.pos)<=K GROUP BY R1.sid, S1.sid, R1.string, S1.string HAVING COUNT(*) >= (max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q
  64. 64. References “Data Quality: Concepts, Methodologies and Techniques”, C. Batini and M. Scannapieco, Springer-Verlag, 2006. “Duplicate Record Detection: A Survey”, A. Elmagarmid, P. Ipeirotis, V. Verykios, IEEE Transactions on Knowledge and Data Engineering, Vol. 19, no. 1, Jan. 2007. Approximate String Joins in a Database (Almost) for Free, Gravano et al, VLDB 2001, Efficient merge and purge: Hernandez and Stolfo, SIGMOD 1995