Successfully reported this slideshow.
Your SlideShare is downloading. ×

Improving Graph Based Entity Resolution with Data Mining and NLP

Ad

Improving Graph Based
Entity Resolution
Using Data Mining and
NLP

Ad

Hello, I’m David
Bechberger
Architect and Developer
● Distributed systems
● High performance low
latency big data platform...

Ad

Entity Resolution

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Check these out next

1 of 57 Ad
1 of 57 Ad

Improving Graph Based Entity Resolution with Data Mining and NLP

Download to read offline

“Hey, here are those new data files to add. I ‘cleaned’ them myself so it should be easy. Right?”
Words like these strike fear into the heart of all developers but integrating ‘dirty’ unstructured, denormalized and text heavy datasets from multiple locations is becoming the de facto standard when building out data platforms.
In this talk we will look at how we can augment our graph’s attributes using techniques from data mining (e.g. string similarity/distance measures) and Natural Language Processing (e.g. keyword extraction, named entity recognition). We will then walkthrough an example using this methodology to demonstrate the improvements in the accuracy of the resulting matches.

“Hey, here are those new data files to add. I ‘cleaned’ them myself so it should be easy. Right?”
Words like these strike fear into the heart of all developers but integrating ‘dirty’ unstructured, denormalized and text heavy datasets from multiple locations is becoming the de facto standard when building out data platforms.
In this talk we will look at how we can augment our graph’s attributes using techniques from data mining (e.g. string similarity/distance measures) and Natural Language Processing (e.g. keyword extraction, named entity recognition). We will then walkthrough an example using this methodology to demonstrate the improvements in the accuracy of the resulting matches.

More Related Content

Improving Graph Based Entity Resolution with Data Mining and NLP

  1. 1. Improving Graph Based Entity Resolution Using Data Mining and NLP
  2. 2. Hello, I’m David Bechberger Architect and Developer ● Distributed systems ● High performance low latency big data platforms ● Graph Databases ● Teach and Mentor fellow developers www.bechbergerconsulting.com www.bechberger.com @bechbd www.linkedin.com/in/davebechberger
  3. 3. Entity Resolution
  4. 4. What is Entity Resolution The process of linking digital entities in data to real world entities.
  5. 5. I am known by many names but you may call me: ● Data referencing ● Record Linkage ● Canonicalization ● Coreference resolution ● Merge/purge ● Entity Clustering ● ….
  6. 6. Why is it Hard? ● Structured versus Unstructured ● Name Ambiguity ● Typos/Transposition/Data Errors ● Missing/Incomplete Data ● Changing Data ● Abbreviations
  7. 7. Two types of ER problems Ones with canonical data Ones without canonical data
  8. 8. Typical Entity Resolution Steps ● Deduplication ● Canonicalization/Standardization ● Blocking/Clustering ● Linking Records
  9. 9. Wait, I thought we were talking about graphs?
  10. 10. Example Graph Entity Resolution Problems ● Master Data Management ● Linking Customers ● Recommendation Engines ● Intrusion Detection ● Fraud analysis
  11. 11. What are we talking about today?
  12. 12. How can Data Mining/NLP help? ● String Similarity ● Named Entity Recognition ● Shingling ● Active/Machine Learning
  13. 13. How can graphs help? ● Aggregating Traversals ● Pattern Matching ● Inferring Relationships ● Path ● Clustering
  14. 14. Example - Product Catalogs
  15. 15. Problem - Matching Product Data ● Product catalog data from Amazon and Google* ● Already deduplicated ● ~1300 Amazon Products, ~3200 Google Products ● Contains a list of perfect matches for testing against *Datasets from Database Leipzig Group and is available at: https://dbs.uni- leipzig.de/de/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution
  16. 16. Goal Match Amazon data with Google data to build out the basis for a master data management solution
  17. 17. What are we starting with? Title Manufacturer Description clickart 950 000 - premier image pack (dvd-rom) broderbund ca international - arcserve lap/desktop oem 30pk computer associates oem arcserve backup v11.1 win 30u for laptops and desktops learning quickbooks 2007 intuit learning quickbooks 2007 eu063av aba microsoft windows xp professional hp eu063av aba : usually ships in 24 hours... ID Title Description Origin NameManufacturer built_by Product
  18. 18. How are we going to get there? 1. Bipartite and Pattern Matching 2. Iteratively add attributes to data 3. Try and match on weighted attributes
  19. 19. Bipartite/Pattern Matching using Gremlin
  20. 20. Bipartite Graph Matching ● Matched on exact titles ● Found 216 matches Quick Book Turbo Tax
  21. 21. Bipartite Graph Matching g.V().hasLabel("product").group(). by(values('title').fold()) .unfold() . filter( select(values).count(local).is(gt(1)) )
  22. 22. Graph Pattern Matching Quick Book Turbo Tax Intuit Corp Intuit built_by built_by ● Matched on manufacturer + fuzzy match on title ● Found 354 matches
  23. 23. Graph Pattern Matching g.V().hasLabel(‘product’).or( .group().by(values(‘manufacturer’).fold()).unfold() .filter( select(values).count(local).is(gt(1)) ), match( __.as('a').has('origin', 'amazon').as('amazon'), __.as('amazon'),has(‘title’, V().has('origin','google') .values(‘title’)).as('google'), __.as('amazon') .has('title',tokenFuzzy(V().has('origin',’google') .values(‘title’)) .values('title'), 2)) ) )
  24. 24. Find Canonical Manufacturers
  25. 25. Find Manufacturers in Amazon data ● Fuzzy match to find unique ● Create and link nodes to unique manufacturers ● Found 227 manufacturers Intuit Intuit Corp Quick Book Intuit Corp built_by built_by CanonicalOriginal
  26. 26. Find Manufacturers in Google data ● ~7% had manufacturers (232/3229) ● 224 products matched existing manufacturers ● Found 8 more unique manufacturers
  27. 27. Validate Canonical Manufacturers ● Review and validate canonical data ● Add edges between data that represent the same entity Sony Sony Corp Intuit Corp Intuit is_same_asis_same_as
  28. 28. Build out the Canonical Manufacturer graph ● Found 235 unique manufacturers ● 14 aliases ● Canonical Manufacturers added to graph with aliases Intuit Corp Intuit is_same_as Micro soft Sony
  29. 29. What’s our graph look like now? Intuit Intuit Corp Intuit Corp Quick Book Intuit Turbo Tax Micro soft Sony is_same_as built_by built_bybuilt_by built_by
  30. 30. Manufacturer Pattern Matching ● Added Manufacturer Traversal into Pattern Match ● Found 534 matches Intuit Intuit Corp Intuit Corp Quick Book CanonicalOriginal Intuit Turbo Tax
  31. 31. Graph Pattern matching g.V().hasLabel(‘product’).or( .group().by(values(‘manufacturer’).fold()).unfold() .filter( select(values).count(local).is(gt(1)) ), match( __.as('a').has('origin', 'amazon').as('amazon'), __.as('amazon'),has(‘title’, V().has('origin','google') .values(‘title’)).as('google'), __.as('amazon') .has('title',tokenFuzzy(V().has('origin',’google') .values(‘title’)) .values('title'), 2)) ), V().repeat(out().hasLabel( within(‘built_by’, ‘is_same_as’))).limit(3))
  32. 32. ~41% ● Found 534 of 1300
  33. 33. Use NLP/Data Mining to add attributes
  34. 34. A quick word on Similarity Measurements ● Many different algorithms, each solves a different problem ● Know your data ● Research the options and ● Choose the right one for your data
  35. 35. Most Google Data Missing Manufacturer Or is it? Example: eu063av aba microsoft windows xp professional - license and media - 1 user - cto - english
  36. 36. Named Entity Recognition Process of classifying entities in strings into known categories microsoft xbox 360: forza motorsport 2 sony playstation 2: karaoke revolution: american idol bundle ibm(r) viavoice(r) advanced edition 10
  37. 37. Damereau-Levenstein Distance ● Measures the edit distance between two strings ● Handles insertions, deletions, transposition and substitutions Sony Snoy Snyo 1 2 2
  38. 38. Add distance attribute Intuit Intuit Corp Intuit Quick Book built_by Canonical distance:2 distance:3
  39. 39. Find similarity between titles Amazon Title Google Title ms visual studio 2011 plus video studio 11 plus Spiderman 3 ps2 activision 81935 spiderman 3 ps2 kids power fun for girls Topic entertainment kids power fun for girls
  40. 40. Jaccard Index ● Set similarity measures between finite sets (A, B) ● Works on n-Grams ● Calculated as Intersection over Union “J(A,B) = |A∩B|/|A⋃B|” N=1 (Unigram) This is a sentence this, is, a, sentence N=2 (Bigram) This is a sentence this is, is a, a sentence N=3 (Trigram) This is a sentence this is a, is a sentence
  41. 41. A = Dragon Natural Speaking 9.0 B = Dragon Natural 9.0 Professional A ⋃ B = 5 A ∩ B = 3 Jaccard Index = ⅗ = 0.60 Jaccard Index A B Dragon Natural Speaking 9.0 Professional
  42. 42. Add jaccard attribute Quick Book Turbo Tax Intuit Corp Intuit built_by built_by jaccard:0.6
  43. 43. Find similarity between descriptions ● Use TF-IDF finds the relative importance of words in a document ● Cosine similarity compares two vectors and gives the similarity between them
  44. 44. TF = # of times a word appears # words in a document IDF = # of documents # of documents with term TF-IDF Word TF-IDF Score unique 4.43 bag 4.34 original 2.945 professional 1.336 log( )
  45. 45. Cosine similarity
  46. 46. Add cosine_similarity attribute Quick Book Turbo Tax Intuit Corp Intuit built_by built_by cosine_similarity:0.75
  47. 47. Putting it all together
  48. 48. What does our graph looks like now? Intuit Corp Intuit is_same_as Quick Book Turbo Tax Intuit Corp Intuit built_by built_by distance:2 distance:2 distance:3 distance:3 jaccard:0.6cosine_similarity:0.75
  49. 49. Aggregating Traversal ● Aggregate all the values into a weighted sum* ● Highest sum was most likely Value = cosine_similarity + jaccard + (manufacturer simplest traversal path where distance is <=2 and path length is <=3) *For this talk I used evenly weighted values, in practice this needs calculated
  50. 50. What does our traversal look like? Intuit Corp Intuit Quick Book Turbo Tax Value = cosine_similarity + jaccard + (traversal paths <3)
  51. 51. So how did we do?
  52. 52. ~87% ● Found 1130 of 1300 ● ~1.2% error rate
  53. 53. Where do we go from here?
  54. 54. Clustering/Blocking ● N-squared comparisons are expensive ● Blocking and Clustering limit comparisons to only those likely to match
  55. 55. Improve NLP/Data Mining Techniques ● Tune algorithms ● Find accurate weighing with Active Learning ● Locality Sensitive Hashing
  56. 56. Toolkits I used? Apache Commons - https://commons.apache.org/ Java String Similarity - https://github.com/tdebatty/java-string-similarity Apache OpenNLP - https://opennlp.apache.org/ Apache Tinkerpop - http://tinkerpop.apache.org/
  57. 57. Thanks, any questions? www.bechbergerconsulting.com www.bechberger.com @bechbd www.linkedin.com/in/davebechberger

Editor's Notes

  • Test text for sizing
  • Not an architect that just draws boxes and lines, I get my hands dirty by actually helping to build these things
  • What this means is resolving data from one or more datasets into a canonical representation of that entity.

    E.g. I have facebook, linkedin, google, twitter etc but there is only one singular entity that is me. Entity resolution is the process of taking each of those disparate data sources and linking them to the singular real world me entity.

    Entity Resolution is not a new problem, its one that has become more important as we get more and more representation of yourself and we want mine interesting data from them
  • Deduplication, Record Linkage, Data referencing, Canonicalization, Coreference resolution, Merge/purge, Object identification, Entity clustering, Object consolidation, Identity uncertainty, Reference reconciliation
  • Its not if you have structured/clean and consistent data, but in reality it isnt

    Dave versus David

    Mispelled names

    Missing items

    Wife changed name

  • Canonical Examples - Countries of the world (195), Fortune 500 companies
    Non-canonical examples - probably the most common, the canonical list has to be made from the data
    Examples are: people, places, products
  • Not going to talk about Dedupe or blocking clustering

    A little bit on canonicalization but mostly on linking records
  • MDM - Getting master data from multiple systems
    Customers - linking customers from multiple different internal systems (email, chat, phone)
    Rec engines - Linking sales and product data across divisions
    Intrustion detection - linking IP spoofs to the same person
    Fraud - Linking fraudulent transactions on multiple cards to same person
  • Combining the best of Graph techniques with standard data mining and NLP techniques to provide a better outcome
  • Lots of different
    String similarity - The process of comparing two strings and finding out how similar/dissimilar they are
    Named Entity Recognition - Process of classifying entities in text into predefined categories
    Shingling - process of tokenizing data to gauge similarity
  • Aggregating Traversals - Using traversals to calculate weighed sums
    Pattern Matching - find patterns
    Inferring relationships
    Path traversals
  • g.V().hasLabel("product").
    group().
    by(values('title').fold()).
    unfold().
    filter(select(values).count(local).is(gt(1))).count()
  • g.V().hasLabel("product").
    group().
    by(values('title').fold()).
    unfold().
    filter(select(values).count(local).is(gt(1))).count()
  • g.V().hasLabel("product").
    group().
    by(values('title').fold()).
    unfold().
    filter(select(values).count(local).is(gt(1))).count()
  • You may wonder why we added unique manufacturers from the google data to our graph if we aren’t matching on them
  • g.V().hasLabel("product").
    group().
    by(values('title').fold()).
    unfold().
    filter(select(values).count(local).is(gt(1))).count()
  • NER works by using labelled training set data to determine entities

    Used canonical manufacturers as training set data

    Input the titles
  • Good for comparing shorter string segments like names
  • TF-IDF turns each document into a vector of numbers
    Values are then normalized using the dot product
    Cosine similarity compares the normalized vectors
  • Produces a normalized vector of relative importance of words
  • Similar scores are close to 1
    Unrelated scores are close to 0
    Opposites are close to -1
  • Summed up the distance between items with cosine similarity, jaccard index and simplest path traversal where distance<=2 and length<=3
  • Locality Sensitive Hashing - create hash codes for data to find others most like it
  • Apache Commons for cosine-similarity and Jaccard Index
    Java Similairty for Damerau-Levensthein
    OpenNLP - for tokenizing and NER
    Tinkerpop for traversals

×