Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Improving Graph Based Entity Resolution with Data Mining and NLP

749 views

Published on

“Hey, here are those new data files to add. I ‘cleaned’ them myself so it should be easy. Right?”
Words like these strike fear into the heart of all developers but integrating ‘dirty’ unstructured, denormalized and text heavy datasets from multiple locations is becoming the de facto standard when building out data platforms.
In this talk we will look at how we can augment our graph’s attributes using techniques from data mining (e.g. string similarity/distance measures) and Natural Language Processing (e.g. keyword extraction, named entity recognition). We will then walkthrough an example using this methodology to demonstrate the improvements in the accuracy of the resulting matches.

Published in: Software
  • Be the first to comment

Improving Graph Based Entity Resolution with Data Mining and NLP

  1. 1. Improving Graph Based Entity Resolution Using Data Mining and NLP
  2. 2. Hello, I’m David Bechberger Architect and Developer ● Distributed systems ● High performance low latency big data platforms ● Graph Databases ● Teach and Mentor fellow developers www.bechbergerconsulting.com www.bechberger.com @bechbd www.linkedin.com/in/davebechberger
  3. 3. Entity Resolution
  4. 4. What is Entity Resolution The process of linking digital entities in data to real world entities.
  5. 5. I am known by many names but you may call me: ● Data referencing ● Record Linkage ● Canonicalization ● Coreference resolution ● Merge/purge ● Entity Clustering ● ….
  6. 6. Why is it Hard? ● Structured versus Unstructured ● Name Ambiguity ● Typos/Transposition/Data Errors ● Missing/Incomplete Data ● Changing Data ● Abbreviations
  7. 7. Two types of ER problems Ones with canonical data Ones without canonical data
  8. 8. Typical Entity Resolution Steps ● Deduplication ● Canonicalization/Standardization ● Blocking/Clustering ● Linking Records
  9. 9. Wait, I thought we were talking about graphs?
  10. 10. Example Graph Entity Resolution Problems ● Master Data Management ● Linking Customers ● Recommendation Engines ● Intrusion Detection ● Fraud analysis
  11. 11. What are we talking about today?
  12. 12. How can Data Mining/NLP help? ● String Similarity ● Named Entity Recognition ● Shingling ● Active/Machine Learning
  13. 13. How can graphs help? ● Aggregating Traversals ● Pattern Matching ● Inferring Relationships ● Path ● Clustering
  14. 14. Example - Product Catalogs
  15. 15. Problem - Matching Product Data ● Product catalog data from Amazon and Google* ● Already deduplicated ● ~1300 Amazon Products, ~3200 Google Products ● Contains a list of perfect matches for testing against *Datasets from Database Leipzig Group and is available at: https://dbs.uni- leipzig.de/de/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution
  16. 16. Goal Match Amazon data with Google data to build out the basis for a master data management solution
  17. 17. What are we starting with? Title Manufacturer Description clickart 950 000 - premier image pack (dvd-rom) broderbund ca international - arcserve lap/desktop oem 30pk computer associates oem arcserve backup v11.1 win 30u for laptops and desktops learning quickbooks 2007 intuit learning quickbooks 2007 eu063av aba microsoft windows xp professional hp eu063av aba : usually ships in 24 hours... ID Title Description Origin NameManufacturer built_by Product
  18. 18. How are we going to get there? 1. Bipartite and Pattern Matching 2. Iteratively add attributes to data 3. Try and match on weighted attributes
  19. 19. Bipartite/Pattern Matching using Gremlin
  20. 20. Bipartite Graph Matching ● Matched on exact titles ● Found 216 matches Quick Book Turbo Tax
  21. 21. Bipartite Graph Matching g.V().hasLabel("product").group(). by(values('title').fold()) .unfold() . filter( select(values).count(local).is(gt(1)) )
  22. 22. Graph Pattern Matching Quick Book Turbo Tax Intuit Corp Intuit built_by built_by ● Matched on manufacturer + fuzzy match on title ● Found 354 matches
  23. 23. Graph Pattern Matching g.V().hasLabel(‘product’).or( .group().by(values(‘manufacturer’).fold()).unfold() .filter( select(values).count(local).is(gt(1)) ), match( __.as('a').has('origin', 'amazon').as('amazon'), __.as('amazon'),has(‘title’, V().has('origin','google') .values(‘title’)).as('google'), __.as('amazon') .has('title',tokenFuzzy(V().has('origin',’google') .values(‘title’)) .values('title'), 2)) ) )
  24. 24. Find Canonical Manufacturers
  25. 25. Find Manufacturers in Amazon data ● Fuzzy match to find unique ● Create and link nodes to unique manufacturers ● Found 227 manufacturers Intuit Intuit Corp Quick Book Intuit Corp built_by built_by CanonicalOriginal
  26. 26. Find Manufacturers in Google data ● ~7% had manufacturers (232/3229) ● 224 products matched existing manufacturers ● Found 8 more unique manufacturers
  27. 27. Validate Canonical Manufacturers ● Review and validate canonical data ● Add edges between data that represent the same entity Sony Sony Corp Intuit Corp Intuit is_same_asis_same_as
  28. 28. Build out the Canonical Manufacturer graph ● Found 235 unique manufacturers ● 14 aliases ● Canonical Manufacturers added to graph with aliases Intuit Corp Intuit is_same_as Micro soft Sony
  29. 29. What’s our graph look like now? Intuit Intuit Corp Intuit Corp Quick Book Intuit Turbo Tax Micro soft Sony is_same_as built_by built_bybuilt_by built_by
  30. 30. Manufacturer Pattern Matching ● Added Manufacturer Traversal into Pattern Match ● Found 534 matches Intuit Intuit Corp Intuit Corp Quick Book CanonicalOriginal Intuit Turbo Tax
  31. 31. Graph Pattern matching g.V().hasLabel(‘product’).or( .group().by(values(‘manufacturer’).fold()).unfold() .filter( select(values).count(local).is(gt(1)) ), match( __.as('a').has('origin', 'amazon').as('amazon'), __.as('amazon'),has(‘title’, V().has('origin','google') .values(‘title’)).as('google'), __.as('amazon') .has('title',tokenFuzzy(V().has('origin',’google') .values(‘title’)) .values('title'), 2)) ), V().repeat(out().hasLabel( within(‘built_by’, ‘is_same_as’))).limit(3))
  32. 32. ~41% ● Found 534 of 1300
  33. 33. Use NLP/Data Mining to add attributes
  34. 34. A quick word on Similarity Measurements ● Many different algorithms, each solves a different problem ● Know your data ● Research the options and ● Choose the right one for your data
  35. 35. Most Google Data Missing Manufacturer Or is it? Example: eu063av aba microsoft windows xp professional - license and media - 1 user - cto - english
  36. 36. Named Entity Recognition Process of classifying entities in strings into known categories microsoft xbox 360: forza motorsport 2 sony playstation 2: karaoke revolution: american idol bundle ibm(r) viavoice(r) advanced edition 10
  37. 37. Damereau-Levenstein Distance ● Measures the edit distance between two strings ● Handles insertions, deletions, transposition and substitutions Sony Snoy Snyo 1 2 2
  38. 38. Add distance attribute Intuit Intuit Corp Intuit Quick Book built_by Canonical distance:2 distance:3
  39. 39. Find similarity between titles Amazon Title Google Title ms visual studio 2011 plus video studio 11 plus Spiderman 3 ps2 activision 81935 spiderman 3 ps2 kids power fun for girls Topic entertainment kids power fun for girls
  40. 40. Jaccard Index ● Set similarity measures between finite sets (A, B) ● Works on n-Grams ● Calculated as Intersection over Union “J(A,B) = |A∩B|/|A⋃B|” N=1 (Unigram) This is a sentence this, is, a, sentence N=2 (Bigram) This is a sentence this is, is a, a sentence N=3 (Trigram) This is a sentence this is a, is a sentence
  41. 41. A = Dragon Natural Speaking 9.0 B = Dragon Natural 9.0 Professional A ⋃ B = 5 A ∩ B = 3 Jaccard Index = ⅗ = 0.60 Jaccard Index A B Dragon Natural Speaking 9.0 Professional
  42. 42. Add jaccard attribute Quick Book Turbo Tax Intuit Corp Intuit built_by built_by jaccard:0.6
  43. 43. Find similarity between descriptions ● Use TF-IDF finds the relative importance of words in a document ● Cosine similarity compares two vectors and gives the similarity between them
  44. 44. TF = # of times a word appears # words in a document IDF = # of documents # of documents with term TF-IDF Word TF-IDF Score unique 4.43 bag 4.34 original 2.945 professional 1.336 log( )
  45. 45. Cosine similarity
  46. 46. Add cosine_similarity attribute Quick Book Turbo Tax Intuit Corp Intuit built_by built_by cosine_similarity:0.75
  47. 47. Putting it all together
  48. 48. What does our graph looks like now? Intuit Corp Intuit is_same_as Quick Book Turbo Tax Intuit Corp Intuit built_by built_by distance:2 distance:2 distance:3 distance:3 jaccard:0.6cosine_similarity:0.75
  49. 49. Aggregating Traversal ● Aggregate all the values into a weighted sum* ● Highest sum was most likely Value = cosine_similarity + jaccard + (manufacturer simplest traversal path where distance is <=2 and path length is <=3) *For this talk I used evenly weighted values, in practice this needs calculated
  50. 50. What does our traversal look like? Intuit Corp Intuit Quick Book Turbo Tax Value = cosine_similarity + jaccard + (traversal paths <3)
  51. 51. So how did we do?
  52. 52. ~87% ● Found 1130 of 1300 ● ~1.2% error rate
  53. 53. Where do we go from here?
  54. 54. Clustering/Blocking ● N-squared comparisons are expensive ● Blocking and Clustering limit comparisons to only those likely to match
  55. 55. Improve NLP/Data Mining Techniques ● Tune algorithms ● Find accurate weighing with Active Learning ● Locality Sensitive Hashing
  56. 56. Toolkits I used? Apache Commons - https://commons.apache.org/ Java String Similarity - https://github.com/tdebatty/java-string-similarity Apache OpenNLP - https://opennlp.apache.org/ Apache Tinkerpop - http://tinkerpop.apache.org/
  57. 57. Thanks, any questions? www.bechbergerconsulting.com www.bechberger.com @bechbd www.linkedin.com/in/davebechberger

×