Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Improving Graph Based
Entity Resolution
Using Data Mining and
NLP
Hello, I’m David
Bechberger
Architect and Developer
● Distributed systems
● High performance low
latency big data platform...
Entity Resolution
What is Entity Resolution
The process of linking digital entities in data to real world entities.
I am known by many names but you may call
me:
● Data referencing
● Record Linkage
● Canonicalization
● Coreference resolut...
Why is it Hard?
● Structured versus Unstructured
● Name Ambiguity
● Typos/Transposition/Data Errors
● Missing/Incomplete D...
Two types of ER problems
Ones with canonical data Ones without canonical data
Typical Entity Resolution Steps
● Deduplication
● Canonicalization/Standardization
● Blocking/Clustering
● Linking Records
Wait, I thought we were talking about graphs?
Example Graph Entity Resolution Problems
● Master Data Management
● Linking Customers
● Recommendation Engines
● Intrusion...
What are we talking about today?
How can Data Mining/NLP help?
● String Similarity
● Named Entity Recognition
● Shingling
● Active/Machine Learning
How can graphs help?
● Aggregating Traversals
● Pattern Matching
● Inferring Relationships
● Path
● Clustering
Example - Product Catalogs
Problem - Matching Product Data
● Product catalog data from Amazon and Google*
● Already deduplicated
● ~1300 Amazon Produ...
Goal
Match Amazon data with Google data to build out the basis for a
master data management solution
What are we starting with?
Title Manufacturer Description
clickart 950 000 -
premier image pack
(dvd-rom)
broderbund
ca in...
How are we going to get there?
1. Bipartite and Pattern Matching
2. Iteratively add attributes to data
3. Try and match on...
Bipartite/Pattern Matching
using Gremlin
Bipartite Graph Matching
● Matched on exact titles
● Found 216 matches
Quick
Book
Turbo
Tax
Bipartite Graph Matching
g.V().hasLabel("product").group().
by(values('title').fold())
.unfold()
. filter(
select(values)....
Graph Pattern Matching
Quick
Book
Turbo
Tax
Intuit
Corp
Intuit
built_by
built_by
● Matched on manufacturer +
fuzzy match o...
Graph Pattern Matching
g.V().hasLabel(‘product’).or(
.group().by(values(‘manufacturer’).fold()).unfold()
.filter(
select(v...
Find Canonical Manufacturers
Find Manufacturers in Amazon data
● Fuzzy match to find unique
● Create and link nodes to
unique manufacturers
● Found 227...
Find Manufacturers in Google data
● ~7% had manufacturers (232/3229)
● 224 products matched existing manufacturers
● Found...
Validate Canonical Manufacturers
● Review and validate canonical
data
● Add edges between data that
represent the same ent...
Build out the Canonical Manufacturer graph
● Found 235 unique manufacturers
● 14 aliases
● Canonical Manufacturers added
t...
What’s our graph look like now?
Intuit
Intuit
Corp
Intuit
Corp
Quick
Book
Intuit
Turbo
Tax
Micro
soft
Sony
is_same_as
buil...
Manufacturer Pattern Matching
● Added Manufacturer
Traversal into Pattern
Match
● Found 534 matches
Intuit
Intuit
Corp
Int...
Graph Pattern matching
g.V().hasLabel(‘product’).or(
.group().by(values(‘manufacturer’).fold()).unfold()
.filter(
select(v...
~41%
● Found 534 of 1300
Use NLP/Data Mining to add
attributes
A quick word on Similarity Measurements
● Many different algorithms, each solves a different problem
● Know your data
● Re...
Most Google Data Missing Manufacturer
Or is it?
Example:
eu063av aba microsoft windows xp
professional - license and media...
Named Entity Recognition
Process of classifying entities in strings into known categories
microsoft xbox 360: forza motors...
Damereau-Levenstein Distance
● Measures the edit distance
between two strings
● Handles insertions,
deletions, transpositi...
Add distance attribute
Intuit
Intuit
Corp
Intuit
Quick
Book
built_by
Canonical
distance:2
distance:3
Find similarity between titles
Amazon Title Google Title
ms visual studio 2011 plus video studio 11 plus
Spiderman 3 ps2 a...
Jaccard Index
● Set similarity measures
between finite sets (A, B)
● Works on n-Grams
● Calculated as Intersection
over Un...
A = Dragon Natural Speaking 9.0
B = Dragon Natural 9.0 Professional
A ⋃ B = 5
A ∩ B = 3
Jaccard Index = ⅗ = 0.60
Jaccard I...
Add jaccard attribute
Quick
Book
Turbo
Tax
Intuit Corp
Intuit
built_by
built_by
jaccard:0.6
Find similarity between descriptions
● Use TF-IDF finds the relative importance of words in a
document
● Cosine similarity...
TF = # of times a word appears
# words in a document
IDF = # of documents
# of documents
with term
TF-IDF
Word TF-IDF Scor...
Cosine similarity
Add cosine_similarity attribute
Quick
Book
Turbo
Tax
Intuit Corp
Intuit
built_by
built_by
cosine_similarity:0.75
Putting it all together
What does our graph looks like now?
Intuit
Corp
Intuit
is_same_as
Quick
Book
Turbo
Tax
Intuit Corp
Intuit
built_by
built_b...
Aggregating Traversal
● Aggregate all the values into a weighted sum*
● Highest sum was most likely
Value = cosine_similar...
What does our traversal look like?
Intuit
Corp
Intuit
Quick
Book
Turbo
Tax
Value = cosine_similarity + jaccard + (traversa...
So how did we do?
~87%
● Found 1130 of 1300
● ~1.2% error rate
Where do we go from here?
Clustering/Blocking
● N-squared comparisons are
expensive
● Blocking and Clustering
limit comparisons to only
those likely...
Improve NLP/Data Mining Techniques
● Tune algorithms
● Find accurate weighing with
Active Learning
● Locality Sensitive Ha...
Toolkits I used?
Apache Commons - https://commons.apache.org/
Java String Similarity - https://github.com/tdebatty/java-st...
Thanks, any questions?
www.bechbergerconsulting.com
www.bechberger.com
@bechbd
www.linkedin.com/in/davebechberger
Upcoming SlideShare
Loading in …5
×

2

Share

Download to read offline

Improving Graph Based Entity Resolution with Data Mining and NLP

Download to read offline

“Hey, here are those new data files to add. I ‘cleaned’ them myself so it should be easy. Right?”
Words like these strike fear into the heart of all developers but integrating ‘dirty’ unstructured, denormalized and text heavy datasets from multiple locations is becoming the de facto standard when building out data platforms.
In this talk we will look at how we can augment our graph’s attributes using techniques from data mining (e.g. string similarity/distance measures) and Natural Language Processing (e.g. keyword extraction, named entity recognition). We will then walkthrough an example using this methodology to demonstrate the improvements in the accuracy of the resulting matches.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Improving Graph Based Entity Resolution with Data Mining and NLP

  1. 1. Improving Graph Based Entity Resolution Using Data Mining and NLP
  2. 2. Hello, I’m David Bechberger Architect and Developer ● Distributed systems ● High performance low latency big data platforms ● Graph Databases ● Teach and Mentor fellow developers www.bechbergerconsulting.com www.bechberger.com @bechbd www.linkedin.com/in/davebechberger
  3. 3. Entity Resolution
  4. 4. What is Entity Resolution The process of linking digital entities in data to real world entities.
  5. 5. I am known by many names but you may call me: ● Data referencing ● Record Linkage ● Canonicalization ● Coreference resolution ● Merge/purge ● Entity Clustering ● ….
  6. 6. Why is it Hard? ● Structured versus Unstructured ● Name Ambiguity ● Typos/Transposition/Data Errors ● Missing/Incomplete Data ● Changing Data ● Abbreviations
  7. 7. Two types of ER problems Ones with canonical data Ones without canonical data
  8. 8. Typical Entity Resolution Steps ● Deduplication ● Canonicalization/Standardization ● Blocking/Clustering ● Linking Records
  9. 9. Wait, I thought we were talking about graphs?
  10. 10. Example Graph Entity Resolution Problems ● Master Data Management ● Linking Customers ● Recommendation Engines ● Intrusion Detection ● Fraud analysis
  11. 11. What are we talking about today?
  12. 12. How can Data Mining/NLP help? ● String Similarity ● Named Entity Recognition ● Shingling ● Active/Machine Learning
  13. 13. How can graphs help? ● Aggregating Traversals ● Pattern Matching ● Inferring Relationships ● Path ● Clustering
  14. 14. Example - Product Catalogs
  15. 15. Problem - Matching Product Data ● Product catalog data from Amazon and Google* ● Already deduplicated ● ~1300 Amazon Products, ~3200 Google Products ● Contains a list of perfect matches for testing against *Datasets from Database Leipzig Group and is available at: https://dbs.uni- leipzig.de/de/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution
  16. 16. Goal Match Amazon data with Google data to build out the basis for a master data management solution
  17. 17. What are we starting with? Title Manufacturer Description clickart 950 000 - premier image pack (dvd-rom) broderbund ca international - arcserve lap/desktop oem 30pk computer associates oem arcserve backup v11.1 win 30u for laptops and desktops learning quickbooks 2007 intuit learning quickbooks 2007 eu063av aba microsoft windows xp professional hp eu063av aba : usually ships in 24 hours... ID Title Description Origin NameManufacturer built_by Product
  18. 18. How are we going to get there? 1. Bipartite and Pattern Matching 2. Iteratively add attributes to data 3. Try and match on weighted attributes
  19. 19. Bipartite/Pattern Matching using Gremlin
  20. 20. Bipartite Graph Matching ● Matched on exact titles ● Found 216 matches Quick Book Turbo Tax
  21. 21. Bipartite Graph Matching g.V().hasLabel("product").group(). by(values('title').fold()) .unfold() . filter( select(values).count(local).is(gt(1)) )
  22. 22. Graph Pattern Matching Quick Book Turbo Tax Intuit Corp Intuit built_by built_by ● Matched on manufacturer + fuzzy match on title ● Found 354 matches
  23. 23. Graph Pattern Matching g.V().hasLabel(‘product’).or( .group().by(values(‘manufacturer’).fold()).unfold() .filter( select(values).count(local).is(gt(1)) ), match( __.as('a').has('origin', 'amazon').as('amazon'), __.as('amazon'),has(‘title’, V().has('origin','google') .values(‘title’)).as('google'), __.as('amazon') .has('title',tokenFuzzy(V().has('origin',’google') .values(‘title’)) .values('title'), 2)) ) )
  24. 24. Find Canonical Manufacturers
  25. 25. Find Manufacturers in Amazon data ● Fuzzy match to find unique ● Create and link nodes to unique manufacturers ● Found 227 manufacturers Intuit Intuit Corp Quick Book Intuit Corp built_by built_by CanonicalOriginal
  26. 26. Find Manufacturers in Google data ● ~7% had manufacturers (232/3229) ● 224 products matched existing manufacturers ● Found 8 more unique manufacturers
  27. 27. Validate Canonical Manufacturers ● Review and validate canonical data ● Add edges between data that represent the same entity Sony Sony Corp Intuit Corp Intuit is_same_asis_same_as
  28. 28. Build out the Canonical Manufacturer graph ● Found 235 unique manufacturers ● 14 aliases ● Canonical Manufacturers added to graph with aliases Intuit Corp Intuit is_same_as Micro soft Sony
  29. 29. What’s our graph look like now? Intuit Intuit Corp Intuit Corp Quick Book Intuit Turbo Tax Micro soft Sony is_same_as built_by built_bybuilt_by built_by
  30. 30. Manufacturer Pattern Matching ● Added Manufacturer Traversal into Pattern Match ● Found 534 matches Intuit Intuit Corp Intuit Corp Quick Book CanonicalOriginal Intuit Turbo Tax
  31. 31. Graph Pattern matching g.V().hasLabel(‘product’).or( .group().by(values(‘manufacturer’).fold()).unfold() .filter( select(values).count(local).is(gt(1)) ), match( __.as('a').has('origin', 'amazon').as('amazon'), __.as('amazon'),has(‘title’, V().has('origin','google') .values(‘title’)).as('google'), __.as('amazon') .has('title',tokenFuzzy(V().has('origin',’google') .values(‘title’)) .values('title'), 2)) ), V().repeat(out().hasLabel( within(‘built_by’, ‘is_same_as’))).limit(3))
  32. 32. ~41% ● Found 534 of 1300
  33. 33. Use NLP/Data Mining to add attributes
  34. 34. A quick word on Similarity Measurements ● Many different algorithms, each solves a different problem ● Know your data ● Research the options and ● Choose the right one for your data
  35. 35. Most Google Data Missing Manufacturer Or is it? Example: eu063av aba microsoft windows xp professional - license and media - 1 user - cto - english
  36. 36. Named Entity Recognition Process of classifying entities in strings into known categories microsoft xbox 360: forza motorsport 2 sony playstation 2: karaoke revolution: american idol bundle ibm(r) viavoice(r) advanced edition 10
  37. 37. Damereau-Levenstein Distance ● Measures the edit distance between two strings ● Handles insertions, deletions, transposition and substitutions Sony Snoy Snyo 1 2 2
  38. 38. Add distance attribute Intuit Intuit Corp Intuit Quick Book built_by Canonical distance:2 distance:3
  39. 39. Find similarity between titles Amazon Title Google Title ms visual studio 2011 plus video studio 11 plus Spiderman 3 ps2 activision 81935 spiderman 3 ps2 kids power fun for girls Topic entertainment kids power fun for girls
  40. 40. Jaccard Index ● Set similarity measures between finite sets (A, B) ● Works on n-Grams ● Calculated as Intersection over Union “J(A,B) = |A∩B|/|A⋃B|” N=1 (Unigram) This is a sentence this, is, a, sentence N=2 (Bigram) This is a sentence this is, is a, a sentence N=3 (Trigram) This is a sentence this is a, is a sentence
  41. 41. A = Dragon Natural Speaking 9.0 B = Dragon Natural 9.0 Professional A ⋃ B = 5 A ∩ B = 3 Jaccard Index = ⅗ = 0.60 Jaccard Index A B Dragon Natural Speaking 9.0 Professional
  42. 42. Add jaccard attribute Quick Book Turbo Tax Intuit Corp Intuit built_by built_by jaccard:0.6
  43. 43. Find similarity between descriptions ● Use TF-IDF finds the relative importance of words in a document ● Cosine similarity compares two vectors and gives the similarity between them
  44. 44. TF = # of times a word appears # words in a document IDF = # of documents # of documents with term TF-IDF Word TF-IDF Score unique 4.43 bag 4.34 original 2.945 professional 1.336 log( )
  45. 45. Cosine similarity
  46. 46. Add cosine_similarity attribute Quick Book Turbo Tax Intuit Corp Intuit built_by built_by cosine_similarity:0.75
  47. 47. Putting it all together
  48. 48. What does our graph looks like now? Intuit Corp Intuit is_same_as Quick Book Turbo Tax Intuit Corp Intuit built_by built_by distance:2 distance:2 distance:3 distance:3 jaccard:0.6cosine_similarity:0.75
  49. 49. Aggregating Traversal ● Aggregate all the values into a weighted sum* ● Highest sum was most likely Value = cosine_similarity + jaccard + (manufacturer simplest traversal path where distance is <=2 and path length is <=3) *For this talk I used evenly weighted values, in practice this needs calculated
  50. 50. What does our traversal look like? Intuit Corp Intuit Quick Book Turbo Tax Value = cosine_similarity + jaccard + (traversal paths <3)
  51. 51. So how did we do?
  52. 52. ~87% ● Found 1130 of 1300 ● ~1.2% error rate
  53. 53. Where do we go from here?
  54. 54. Clustering/Blocking ● N-squared comparisons are expensive ● Blocking and Clustering limit comparisons to only those likely to match
  55. 55. Improve NLP/Data Mining Techniques ● Tune algorithms ● Find accurate weighing with Active Learning ● Locality Sensitive Hashing
  56. 56. Toolkits I used? Apache Commons - https://commons.apache.org/ Java String Similarity - https://github.com/tdebatty/java-string-similarity Apache OpenNLP - https://opennlp.apache.org/ Apache Tinkerpop - http://tinkerpop.apache.org/
  57. 57. Thanks, any questions? www.bechbergerconsulting.com www.bechberger.com @bechbd www.linkedin.com/in/davebechberger
  • MilanAgatonovic

    Mar. 29, 2019
  • MarissaGorlick

    Feb. 1, 2018

“Hey, here are those new data files to add. I ‘cleaned’ them myself so it should be easy. Right?” Words like these strike fear into the heart of all developers but integrating ‘dirty’ unstructured, denormalized and text heavy datasets from multiple locations is becoming the de facto standard when building out data platforms. In this talk we will look at how we can augment our graph’s attributes using techniques from data mining (e.g. string similarity/distance measures) and Natural Language Processing (e.g. keyword extraction, named entity recognition). We will then walkthrough an example using this methodology to demonstrate the improvements in the accuracy of the resulting matches.

Views

Total views

2,530

On Slideshare

0

From embeds

0

Number of embeds

372

Actions

Downloads

46

Shares

0

Comments

0

Likes

2

×