Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Primer on Entity Resolution

6,963 views

Published on

Entity Resolution is the task of disambiguating manifestations of real world entities through linking and grouping and is often an essential part of the data wrangling process. There are three primary tasks involved in entity resolution: deduplication, record linkage, and canonicalization; each of which serve to improve data quality by reducing irrelevant or repeated data, joining information from disparate records, and providing a single source of information to perform analytics upon. However, due to data quality issues (misspellings or incorrect data), schema variations in different sources, or simply different representations, entity resolution is not a straightforward process and most ER techniques utilize machine learning and other stochastic approaches.

Published in: Data & Analytics

A Primer on Entity Resolution

  1. 1. A Primer on Entity Resolution
  2. 2. Workshop Objectives ● Introduce entity resolution theory and tasks ● Similarity scores and similarity vectors ● Pairwise matching with the Fellegi Sunter algorithm ● Clustering and Blocking for deduplication ● Final notes on entity resolution
  3. 3. Entity Resolution Theory
  4. 4. Entity Resolution refers to techniques that identify, group, and link digital mentions or manifestations of some object in the real world.
  5. 5. In the Data Science Pipeline, ER is generally a wrangling technique. ComputationStorageDataInteraction Computational Data Store Feature Analysis Model Builds Model Selection & Monitoring NormalizationIngestion Feedback Wrangling API Cross Validation
  6. 6. - Creation of high quality data sets - Reduction in the number of instances in machine learning models - Reduction in the amount of covariance and therefore collinearity of predictor variables. - Simplification of relationships Information Quality
  7. 7. Graph Analysis Simplification and Connection ben@ddl.com selma@gmail.comtony@ddl.com allen@acme.com rebecca@acme.com ben@gmail.com tony@gmail.com Ben Rebecca Allen Tony Selma
  8. 8. - Heterogenous data: unstructured records - Larger and more varied datasets - Multi-domain and multi-relational data - Varied applications (web and mobile) Parallel, Probabilistic Methods Required* * Although this is often debated in various related domains. Machine Learning and ER
  9. 9. Entity Resolution Tasks Deduplication Record Linkage Canonicalization Referencing
  10. 10. - Primary consideration in ER - Cluster records that correspond to the same real world entity, normalizing a schema - Reduces number of records in the dataset - Variant: compute cluster representative Deduplication
  11. 11. Record Linkage - Match records from one deduplicated data store to another (bipartite) - K-partite linkage links records in multiple data stores and their various associations - Generally proposed in relational data stores, but more frequently applied to unstructured records from various sources.
  12. 12. Referencing - Known as entity disambiguation - Match noisy records to a clean, deduplicated reference table that is already canonicalized - Generally used to atomize multiple records to some primary key and donate extra information to the record
  13. 13. Canonicalization - Compute representative - Generally the “most complete” record - Imputation of missing attributes via merging - Attribute selection based on the most likely candidate for downstream matching
  14. 14. Notation - R: set of records - M: set of matches - N: set of non-matches - E: set of entities - L: set of links Compare (Mt ,Nt ,Et ,Lt )⇔(Mp ,Np ,Ep ,Lp ) - t = true, p = predicted
  15. 15. Key Assumptions - Every entity refers to a real world object (e.g. there are no “fake” instances - References or sources (for record linkage) include no duplicates (integrity constraints) - If two records are identical, they are true matches ( , ) ∈ Mt
  16. 16. - NLTK: natural language toolkit - Dedupe*: structured deduplication - Distance: C implemented distance metrics - Scikit-Learn: machine learning models - Fuzzywuzzy: fuzzy string matching - PyBloom: probabilistic set matching Tools for Entity Resolution
  17. 17. Similarity
  18. 18. At the heart of any entity resolution task is the computation of similarity or distance.
  19. 19. For two records, x and y, compute a similarity vector for each component attribute: [match_score(attrx , attry ) for attr in zip(x,y)] Where match_score is a per-attribute function that computes either a boolean (match, not match) or real valued distance score. match_score ∈ [0,1]*
  20. 20. x = { 'id': 'b0000c7fpt', 'title': 'reel deal casino shuffle master edition', 'description': 'reel deal casino shuffle master edition is ...', 'manufacturer': 'phantom efx', 'price': 19.99, 'source': 'amazon', } y = { 'id': '17175991674191849246', 'name': 'phantom efx reel deal casino shuffle master edition', 'description': 'reel deal casino shuffle master ed. is ...', 'manufacturer': None, 'price': 17.24, 'source': 'google', }
  21. 21. # similarity vector is a match score of: # [name_score, description_score, manufacturer_score, price_score] # Boolean Match similarity(x,y) == [0, 1, 0, 0] # Real Valued Match similarity(x,y) == [0.83, 1.0, 0, 2.75]
  22. 22. Match Scores Reference String Matching Distance Metrics Relational Matching Other Matching Edit Distance - Levenstein - Smith-Waterman - Affine Alignment - Jaro-Winkler - Soft-TFIDF - Monge-Elkan Phonetic - Soundex - Translation - Euclidean - Manhattan - Minkowski Text Analytics - Jaccard - TFIDF - Cosine similarity Set Based - Dice - Tanimoto (Jaccard) - Common Neighbors - Adar Weighted Aggregates - Average values - Max/Min values - Medians - Frequency (Mode) - Numeric distance - Boolean equality - Fuzzy matching - Domain specific Gazettes - Lexical matching - Named Entities (NER)
  23. 23. Fellegi Sunter
  24. 24. Pairwise Matching: Given a vector of attribute match scores for a pair of records (x,y) compute Pmatch (x,y).
  25. 25. Weighted Sum + Threshold Pmatch = sum(weight*score for score in vector) - weights should sum to one - determine weight for each attribute match score - higher weights for more predictive features - e.g. email more predictive than username - attribute value also contributes to predictability - If weighted score > threshold then match.
  26. 26. Rule Based Approach - Formulate rules about the construction of a match for attribute collections. if scorename > 0.75 && scoreprice > 0.6 - Although formulating rules is hard, domain specific rules can be applied, making this a typical approach for many applications.
  27. 27. Modern record linkage theory was formalized in 1969 by Ivan Fellegi and Alan Sunter who proved that the probabilistic decision rule they described was optimal when the comparison attributes were conditionally independent. Their pioneering work “A Theory For Record Linkage” remains the mathematical foundation for many record linkage applications even today. Fellegi, Ivan P., and Alan B. Sunter. "A theory for record linkage." Journal of the American Statistical Association 64.328 (1969): 1183-1210.
  28. 28. Record Linkage Model For two record sets, A and B: and a record pair, is the similarity vector, where is some match score function for the record set. M is the match set and U the non-match set
  29. 29. Record Linkage Model Probabilistic linkage based on: Linkage Rule: L(tl , tu ) - upper & lower thresholds: R(r) tu tl MatchUncertainNon-Match
  30. 30. Linkage Rule Error - Type I Error: a non-match is called a match. - Type II Error: match is called a non-match
  31. 31. Optimizing a Linkage Rule L* (t* l , t* u ) is optimized in (similarity vector space) with error bounds and if: - L* bounds type I and II errors: - L* has the least conditional probability of not making a decision - e.g. minimizes the uncertainty range in R(r).
  32. 32. L* Discovery Given N records in (e.g. N similarity vectors): Sort the records decreasing by R(r) (m( ) / u( )) Select n and n′ such that: R(r) 0 1 , … , n n+1 , … , n′ -1 n′ , … , N , ,
  33. 33. Practical Application of FS is high dimensional: m( ) and u( ) computations are inefficient. Typically a naive Bayes assumption is made about the conditional independence of features in given a match or a non-match. Computing P( |r ∈ M) requires knowledge of matches. - Supervised machine learning with a training set. - Expectation Maximization (EM) to train parameters.
  34. 34. Machine Learning Parameters Supervised Methods - Decision Trees - Cochinwala, Munir, et al. "Efficient data reconciliation." Information Sciences 137.1 (2001): 1-15. - Support Vector Machines - Bilenko, Mikhail, and Raymond J. Mooney. "Adaptive duplicate detection using learnable string similarity measures." Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003. - Christen, Peter. "Automatic record linkage using seeded nearest neighbour and support vector machine classification." Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2008. - Ensembles of Classifiers - Chen, Zhaoqi, Dmitri V. Kalashnikov, and Sharad Mehrotra. "Exploiting context analysis for combining multiple entity resolution systems." Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009. - Conditional Random Fields - Gupta, Rahul, and Sunita Sarawagi. "Answering table augmentation queries from unstructured lists on the web." Proceedings of the VLDB Endowment 2.1 (2009): 289-300.
  35. 35. Machine Learning Parameters Unsupervised Methods - Expectation Maximization - Winkler, William E. "Overview of record linkage and current research directions." Bureau of the Census. 2006. - Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler. Data quality and record linkage techniques. Springer Science & Business Media, 2007. - Hierarchical Clustering - Ravikumar, Pradeep, and William W. Cohen. "A hierarchical graphical model for record linkage."Proceedings of the 20th conference on Uncertainty in artificial intelligence. AUAI Press, 2004. Active Learning Methods - Committee of Classifiers - Sarawagi, Sunita, and Anuradha Bhamidipaty. "Interactive deduplication using active learning."Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002. - Tejada, Sheila, Craig A. Knoblock, and Steven Minton. "Learning object identification rules for information integration." Information Systems 26.8 (2001): 607-633.
  36. 36. Luckily, all of these models are in Scikit-Learn. Considerations: - Building training sets is hard: - Most records are easy non-matches - Record pairs can be ambiguous - Class imbalance: more negatives than positives Machine Learning & Fellegi Sunter is the state of the art. Implementing Papers
  37. 37. Clustering & Blocking
  38. 38. To obtain a supervised training set, start by using clustering and then add active learning techniques to propose items to knowledge engineers for labeling.
  39. 39. Advantages to Clusters - Resolution decisions are not made simply on pairwise comparisons, but search a larger space. - Can use a variety of algorithms such that: - Number of clusters is not known in advance - There are numerous small, singleton clusters - Input is a pairwise similarity graph
  40. 40. Requirement: Blocking - Naive Approach is |R|2 comparisons. - Consider 100,000 products from 10 online stores is 1,000,000,000,000 comparisons. - At 1 s per comparison = 11.6 days - Most are not going to be matches - Can we block on product category?
  41. 41. Canopy Clustering - Often used as a pre-clustering optimization for approaches that must do pairwise comparisons, e.g. K- Means or Hierarchical Clustering - Can be run in parallel, and is often used in Big Data systems (implementations exist in MapReduce on Hadoop) - Use distance metric on similarity vectors for computation.
  42. 42. Canopy Clustering The algorithm begins with two thresholds T1 and T2 the loose and tight distances respectively, where T1 > T2 . 1. Remove a point from the set and start a new “canopy” 2. For each point in the set, assign it to the new canopy if the distance is less than the loose distance T1 . 3. If the distance is less than T2 remove it from the original set completely. 4. Repeat until there are no more data points to cluster.
  43. 43. Canopy Clustering By setting threshold values relatively permissively - canopies will capture more data. In practice, most canopies will contain only a single point, and can be ignored. Pairwise comparisons are made between the similarity vectors inside of each canopy.
  44. 44. Final Notes
  45. 45. Data Preparation Good Data Preparation can go a long way in getting good results, and is most of the work. - Data Normalization - Schema Normalization - Imputation
  46. 46. Data Normalization - convert to all lower case, remove whitespace - run spell checker to remove known typographical errors - expand abbreviations, replace nicknames - perform lookups in lexicons - tokenize, stem, or lemmatize words
  47. 47. Schema Normalization - match attribute names (title → name) - compound attributes (full name → first, last) - nested attributes, particularly boolean attributes - deal with set and list valued attributes - segment records from raw text
  48. 48. Imputation - How do you deal with missing values? - Set all to nan or None, remove empty string. - How do you compare missing values? Omit from similarity vector? - Fill in missing values with aggregate (mean) or with some default value.
  49. 49. Canonicalization Merge information from duplicates to a representative entity that contains maximal information - consider downstream resolution. Name, Email, Phone, Address Joe Halifax, joe.halifax@gmail.com, null, New York, NY Joseph Halifax Jr., null, (212) 123-4444, 130 5th Ave Apt 12, New York, NY Joseph Halifax, joe.halifax@gmail.com, (212) 123-4444, 130 5th Ave Apt 12, New York, NY
  50. 50. Evaluation - # of predicted matching pairs, cluster level metrics - Precision/Recall → F1 score Match Miss Actual Match True Match False Match |A| Actual Miss False Miss True Miss |B| |P(A)| |P(B)| total
  51. 51. Conclusion

×