Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

3

Share

Download to read offline

Will Lyon- Entity Resolution

Download to read offline

Data deduplication, or entity resolution, is a common problem for anyone working with data, especially public data sets. Many real world datasets do not contain unique IDs, instead, we often use a combination of fields to identify unique entities across records by linking and grouping. This talk will show how we can use active learning techniques to train learnable similarity functions that outperform standard similarity metrics (such as edit or cosine distance) for deduplicating data in a graph database. Further, we show how these techniques can be enhanced by inspecting the structure of the graph to inform the linking and grouping processes. We will demonstrate how to use open source tools to perform entity resolution on a dataset of campaign finance contributions loaded into the Neo4j graph database.

Related Books

Free with a 30 day trial from Scribd

See all

Will Lyon- Entity Resolution

  1. 1. Entity Resolution In Graph Data An Active Learning Approach William Lyon @lyonwj lyonwj.com Sept 2017
  2. 2. William Lyon Developer Relations Engineer @neo4j will@neo4j.com @lyonwj lyonwj.com
  3. 3. https://neo4j.com/graph-database-data-journalism-accelerator-program/
  4. 4. Entity Resolution Deduplication ?
  5. 5. Example: Campaign Finance
  6. 6. contributions committees candidates http://www.fec.gov/finance/disclosure/ftpdet.shtml
  7. 7. Datamodel http://bit.ly/neo4jire
  8. 8. candidates https://github.com/johnymontana/neo4j-datasets/tree/master/us-fec-elections-2016
  9. 9. committees https://github.com/johnymontana/neo4j-datasets/tree/master/us-fec-elections-2016
  10. 10. contributions Unique id??? https://github.com/johnymontana/neo4j-datasets/tree/master/us-fec-elections-2016
  11. 11. contributions Synthetic id on concat(name, zip) https://github.com/johnymontana/neo4j-datasets/tree/master/us-fec-elections-2016
  12. 12. Sample data
  13. 13. Entity Resolution In Graph Data ?
  14. 14. Entity Resolution In Graph Data 1) Inferred relationships 2) Joining datasets 3) Aggregating traversals
  15. 15. 1) Inferred relationships https://offshoreleaks.icij.org/pages/database
  16. 16. 1) Inferred relationships https://offshoreleaks.icij.org/pages/database
  17. 17. 1) Inferred relationships LIVES_WITH https://offshoreleaks.icij.org/pages/database
  18. 18. 2) Joining datasets
  19. 19. 2) Joining datasets http://www.lyonwj.com/2017/01/30/trumpworld-us-contracting-data-neo4j/ +
  20. 20. 2) Joining datasets http://www.lyonwj.com/2017/01/30/trumpworld-us-contracting-data-neo4j/ +
  21. 21. 2) Joining datasets http://www.cnn.com/2017/05/19/politics/private-prisons/index.html https://www.nytimes.com/2017/02/24/opinion/under-mr-trump http://www.cnn.com/2017/08/18/politics/private-prison-department-of-justice/index.html
  22. 22. 3) Aggregating traversals
  23. 23. 3) Aggregating traversals
  24. 24. Entity Resolution An Active Learning Approach
  25. 25. Are these the same entity? Name Address Phone Email Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com Name Address Phone Email Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com
  26. 26. Are these the same entity? Name Address Phone Email Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com Name Address Phone Email Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com Probably, but how to quantify?
  27. 27. Are these the same entity? Name Bob Loblaw Name Robert Loblaw Probably, but how to quantify?
  28. 28. Are these the same entity? Name Bob Loblaw Name Robert Loblaw String distance (similarity) metric
  29. 29. Are these the same entity? Name Bob Loblaw Name Robert Loblaw Edit distance
  30. 30. Are these the same entity? Name Bob Loblaw Name Robert Loblaw Edit distance Edits required to convert “Bob Loblaw” to “Robert Loblaw”
  31. 31. Are these the same entity? Name Bob Loblaw Name Robert Loblaw Edit distance 4
  32. 32. Are these the same entity? Name Bob Loblaw Name Robert Loblaw Edit distance
  33. 33. Are these the same entity? Name Bob Loblaw Name Robert Loblaw Edit distance
  34. 34. Are these the same entity? Name Bob Loblaw Name Robert Loblaw TF/IDF • Term based • set of words • Order doesn’t matter • Words weighted based on probability
  35. 35. Are these the same entity? Name Bob Loblaw Name Robert Loblaw TF/IDF • Pro: • takes advantage of frequency • Order doesn’t matter • Willam Cohen -vs- Cohen, William • Con: • Spelling errors / abbreviations • Order doesn’t matter • City National Bank -vs- National City Bank
  36. 36. Are these the same entity? Name Bob Loblaw Name Robert Loblaw Edit distance probabilistic extensions • Gap distance • Edit distance + HMM
  37. 37. Are these the same entity? Name Bob Loblaw Name Robert Loblaw Soundex • Phonetic indexing scheme • Genealogy • Soundex code • Hash
  38. 38. Are these the same entity? Name Bob Loblaw Name Robert Loblaw Soundex
  39. 39. Are these the same entity? Name Address Phone Email Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com Name Address Phone Email Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com Dist 4
  40. 40. Are these the same entity? Name Address Phone Email Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com Name Address Phone Email Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com Dist 4 17 Null 0
  41. 41. Are these the same entity? Name Address Phone Email Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com Name Address Phone Email Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com Dist 4 17 Null 0 Weight ??? ??? ??? ???
  42. 42. Are these the same entity? Name Address Phone Email Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com Name Address Phone Email Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com Dist 4 17 Null 0 Weight 0.2 0.03 0.02 0.75
  43. 43. Are these the same entity? Name Address Phone Email Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com Name Address Phone Email Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com Dist 4 17 Null 0 Weight 0.2 0.03 0.02 0.75 Weighted distance: (4*0.2)+(0.03*17)+(0*0.02)+(0*.75) = 1.31
  44. 44. Are these the same entity? Name Address Phone Email Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com Name Address Phone Email Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com Dist 4 17 Null 0 Weight 0.2 0.03 0.02 0.75 Where did the weights come from?
  45. 45. Active Learning
  46. 46. Active Learning • Goal: learn weights for distance per field • Example pairs w/ labels • Minimize human labeling time • Present borderline pairs to human for labeling • Relearn weights • Iterate Name Address Phone Email Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com Name Address Phone Email Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com
  47. 47. Logistic regression • Categorical dependent variable • binary (0,1) • Classification • Fit logistic function https://en.wikipedia.org/wiki/Logistic_function
  48. 48. Logistic regression - example Distance 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50 Nonduplicate 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1
  49. 49. Logistic regression - example Probability of non-duplicate vs edit distance Probabilityofnon-duplicate Edit distance Distance 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50 Nonduplicate 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1
  50. 50. Logistic regression - example Probability of non-duplicate vs edit distance Probabilityofnon-duplicate Edit distance Distance 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50 Nonduplicate 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 Coefficient Std Err Intercept -4.0777 1.7610 Distance 1.5046 0.6287
  51. 51. Blocking - making smart comparisons • Problem: All-pairs comparison is expensive • (1000*999) / 2 = 499,500 pairs • Duplicates are rare • almost all pairs are not duplicates • Blocking functions limit record pairs to be compared • “Somewhat similar” records
  52. 52. Blocking - making smart comparisons • Predicate blocks • whole field • token field • common integer • same three char start • common n gram • Index block • inverted index • find similar records close to each other in the index
  53. 53. Code samples
  54. 54. Dedupe by Datamade • Entity resolution Python library • Active learning • Logistic regression • Blocking • Generate CSV or query database directly • Cypher, python driver for Neo4j https://github.com/dedupeio
  55. 55. Training
  56. 56. Active Learning
  57. 57. Writing results
  58. 58. Resources
  59. 59. Neo4j Sandbox neo4jsandbox.com
  60. 60. GraphConnect GraphConnect.com Promo code: INTUIT50
  61. 61. GraphConnect http://graphconnect.com/ Promo code: INTUIT50
  62. 62. (you)-[:HAVE]->(?) (?)<-[:ANSWERS]-(will)
  • parraguezr

    Jul. 28, 2019
  • ssuser2849d7

    Jun. 20, 2018
  • nitinbitspilani

    Sep. 22, 2017

Data deduplication, or entity resolution, is a common problem for anyone working with data, especially public data sets. Many real world datasets do not contain unique IDs, instead, we often use a combination of fields to identify unique entities across records by linking and grouping. This talk will show how we can use active learning techniques to train learnable similarity functions that outperform standard similarity metrics (such as edit or cosine distance) for deduplicating data in a graph database. Further, we show how these techniques can be enhanced by inspecting the structure of the graph to inform the linking and grouping processes. We will demonstrate how to use open source tools to perform entity resolution on a dataset of campaign finance contributions loaded into the Neo4j graph database.

Views

Total views

1,809

On Slideshare

0

From embeds

0

Number of embeds

15

Actions

Downloads

84

Shares

0

Comments

0

Likes

3

×