SlideShare a Scribd company logo
Improving Graph Based
Entity Resolution
Using Data Mining and
NLP
Hello, I’m David
Bechberger
Architect and Developer
● Distributed systems
● High performance low
latency big data platforms
● Graph Databases
● Teach and Mentor fellow
developers
www.bechbergerconsulting.com
www.bechberger.com
@bechbd
www.linkedin.com/in/davebechberger
Entity Resolution
What is Entity Resolution
The process of linking digital entities in data to real world entities.
I am known by many names but you may call
me:
● Data referencing
● Record Linkage
● Canonicalization
● Coreference resolution
● Merge/purge
● Entity Clustering
● ….
Why is it Hard?
● Structured versus Unstructured
● Name Ambiguity
● Typos/Transposition/Data Errors
● Missing/Incomplete Data
● Changing Data
● Abbreviations
Two types of ER problems
Ones with canonical data Ones without canonical data
Typical Entity Resolution Steps
● Deduplication
● Canonicalization/Standardization
● Blocking/Clustering
● Linking Records
Wait, I thought we were talking about graphs?
Example Graph Entity Resolution Problems
● Master Data Management
● Linking Customers
● Recommendation Engines
● Intrusion Detection
● Fraud analysis
What are we talking about today?
How can Data Mining/NLP help?
● String Similarity
● Named Entity Recognition
● Shingling
● Active/Machine Learning
How can graphs help?
● Aggregating Traversals
● Pattern Matching
● Inferring Relationships
● Path
● Clustering
Example - Product Catalogs
Problem - Matching Product Data
● Product catalog data from Amazon and Google*
● Already deduplicated
● ~1300 Amazon Products, ~3200 Google Products
● Contains a list of perfect matches for testing against
*Datasets from Database Leipzig Group and is available at: https://dbs.uni-
leipzig.de/de/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution
Goal
Match Amazon data with Google data to build out the basis for a
master data management solution
What are we starting with?
Title Manufacturer Description
clickart 950 000 -
premier image pack
(dvd-rom)
broderbund
ca international -
arcserve lap/desktop
oem 30pk
computer associates oem arcserve backup
v11.1 win 30u for
laptops and desktops
learning quickbooks
2007
intuit learning quickbooks
2007
eu063av aba
microsoft windows xp
professional
hp eu063av aba :
usually ships in 24
hours...
ID
Title
Description
Origin
NameManufacturer
built_by
Product
How are we going to get there?
1. Bipartite and Pattern Matching
2. Iteratively add attributes to data
3. Try and match on weighted attributes
Bipartite/Pattern Matching
using Gremlin
Bipartite Graph Matching
● Matched on exact titles
● Found 216 matches
Quick
Book
Turbo
Tax
Bipartite Graph Matching
g.V().hasLabel("product").group().
by(values('title').fold())
.unfold()
. filter(
select(values).count(local).is(gt(1))
)
Graph Pattern Matching
Quick
Book
Turbo
Tax
Intuit
Corp
Intuit
built_by
built_by
● Matched on manufacturer +
fuzzy match on title
● Found 354 matches
Graph Pattern Matching
g.V().hasLabel(‘product’).or(
.group().by(values(‘manufacturer’).fold()).unfold()
.filter(
select(values).count(local).is(gt(1))
),
match(
__.as('a').has('origin', 'amazon').as('amazon'),
__.as('amazon'),has(‘title’, V().has('origin','google')
.values(‘title’)).as('google'),
__.as('amazon')
.has('title',tokenFuzzy(V().has('origin',’google')
.values(‘title’))
.values('title'), 2))
)
)
Find Canonical Manufacturers
Find Manufacturers in Amazon data
● Fuzzy match to find unique
● Create and link nodes to
unique manufacturers
● Found 227 manufacturers Intuit
Intuit
Corp
Quick
Book
Intuit
Corp
built_by
built_by
CanonicalOriginal
Find Manufacturers in Google data
● ~7% had manufacturers (232/3229)
● 224 products matched existing manufacturers
● Found 8 more unique manufacturers
Validate Canonical Manufacturers
● Review and validate canonical
data
● Add edges between data that
represent the same entity
Sony
Sony
Corp
Intuit
Corp
Intuit
is_same_asis_same_as
Build out the Canonical Manufacturer graph
● Found 235 unique manufacturers
● 14 aliases
● Canonical Manufacturers added
to graph with aliases
Intuit
Corp
Intuit
is_same_as
Micro
soft
Sony
What’s our graph look like now?
Intuit
Intuit
Corp
Intuit
Corp
Quick
Book
Intuit
Turbo
Tax
Micro
soft
Sony
is_same_as
built_by
built_bybuilt_by
built_by
Manufacturer Pattern Matching
● Added Manufacturer
Traversal into Pattern
Match
● Found 534 matches
Intuit
Intuit
Corp
Intuit
Corp
Quick
Book
CanonicalOriginal
Intuit
Turbo
Tax
Graph Pattern matching
g.V().hasLabel(‘product’).or(
.group().by(values(‘manufacturer’).fold()).unfold()
.filter(
select(values).count(local).is(gt(1))
),
match(
__.as('a').has('origin', 'amazon').as('amazon'),
__.as('amazon'),has(‘title’, V().has('origin','google')
.values(‘title’)).as('google'),
__.as('amazon')
.has('title',tokenFuzzy(V().has('origin',’google')
.values(‘title’))
.values('title'), 2))
),
V().repeat(out().hasLabel(
within(‘built_by’, ‘is_same_as’))).limit(3))
~41%
● Found 534 of 1300
Use NLP/Data Mining to add
attributes
A quick word on Similarity Measurements
● Many different algorithms, each solves a different problem
● Know your data
● Research the options and
● Choose the right one for your data
Most Google Data Missing Manufacturer
Or is it?
Example:
eu063av aba microsoft windows xp
professional - license and media
- 1 user - cto - english
Named Entity Recognition
Process of classifying entities in strings into known categories
microsoft xbox 360: forza motorsport 2
sony playstation 2: karaoke revolution: american idol bundle
ibm(r) viavoice(r) advanced edition 10
Damereau-Levenstein Distance
● Measures the edit distance
between two strings
● Handles insertions,
deletions, transposition and
substitutions
Sony
Snoy Snyo
1 2
2
Add distance attribute
Intuit
Intuit
Corp
Intuit
Quick
Book
built_by
Canonical
distance:2
distance:3
Find similarity between titles
Amazon Title Google Title
ms visual studio 2011 plus video studio 11 plus
Spiderman 3 ps2 activision 81935 spiderman 3
ps2
kids power fun for girls Topic entertainment kids
power fun for girls
Jaccard Index
● Set similarity measures
between finite sets (A, B)
● Works on n-Grams
● Calculated as Intersection
over Union
“J(A,B) = |A∩B|/|A⋃B|”
N=1 (Unigram)
This is a sentence
this, is, a,
sentence
N=2 (Bigram)
This is a sentence
this is, is a,
a sentence
N=3 (Trigram)
This is a sentence
this is a,
is a sentence
A = Dragon Natural Speaking 9.0
B = Dragon Natural 9.0 Professional
A ⋃ B = 5
A ∩ B = 3
Jaccard Index = ⅗ = 0.60
Jaccard Index
A B
Dragon
Natural
Speaking
9.0
Professional
Add jaccard attribute
Quick
Book
Turbo
Tax
Intuit Corp
Intuit
built_by
built_by
jaccard:0.6
Find similarity between descriptions
● Use TF-IDF finds the relative importance of words in a
document
● Cosine similarity compares two vectors and gives the similarity
between them
TF = # of times a word appears
# words in a document
IDF = # of documents
# of documents
with term
TF-IDF
Word TF-IDF Score
unique 4.43
bag 4.34
original 2.945
professional 1.336
log( )
Cosine similarity
Add cosine_similarity attribute
Quick
Book
Turbo
Tax
Intuit Corp
Intuit
built_by
built_by
cosine_similarity:0.75
Putting it all together
What does our graph looks like now?
Intuit
Corp
Intuit
is_same_as
Quick
Book
Turbo
Tax
Intuit Corp
Intuit
built_by
built_by
distance:2
distance:2
distance:3
distance:3
jaccard:0.6cosine_similarity:0.75
Aggregating Traversal
● Aggregate all the values into a weighted sum*
● Highest sum was most likely
Value = cosine_similarity + jaccard + (manufacturer simplest
traversal path where distance is <=2 and path length is <=3)
*For this talk I used evenly weighted values, in practice this needs calculated
What does our traversal look like?
Intuit
Corp
Intuit
Quick
Book
Turbo
Tax
Value = cosine_similarity + jaccard + (traversal paths <3)
So how did we do?
~87%
● Found 1130 of 1300
● ~1.2% error rate
Where do we go from here?
Clustering/Blocking
● N-squared comparisons are
expensive
● Blocking and Clustering
limit comparisons to only
those likely to match
Improve NLP/Data Mining Techniques
● Tune algorithms
● Find accurate weighing with
Active Learning
● Locality Sensitive Hashing
Toolkits I used?
Apache Commons - https://commons.apache.org/
Java String Similarity - https://github.com/tdebatty/java-string-similarity
Apache OpenNLP - https://opennlp.apache.org/
Apache Tinkerpop - http://tinkerpop.apache.org/
Thanks, any questions?
www.bechbergerconsulting.com
www.bechberger.com
@bechbd
www.linkedin.com/in/davebechberger

More Related Content

Similar to Improving Graph Based Entity Resolution with Data Mining and NLP

Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Databricks
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine Learning
Ilya Grigorik
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
Bhaskar Mitra
 
Bootstrapping Entity Alignment with Knowledge Graph Embedding
Bootstrapping Entity Alignment with Knowledge Graph EmbeddingBootstrapping Entity Alignment with Knowledge Graph Embedding
Bootstrapping Entity Alignment with Knowledge Graph Embedding
Nanjing University
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B Tests
Databricks
 
Talk pg conf eu 2013
Talk pg conf eu 2013Talk pg conf eu 2013
Talk pg conf eu 2013
Atri Sharma
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Spark Summit
 
Engineering Student MuleSoft Meetup#6 - Basic Understanding of DataWeave With...
Engineering Student MuleSoft Meetup#6 - Basic Understanding of DataWeave With...Engineering Student MuleSoft Meetup#6 - Basic Understanding of DataWeave With...
Engineering Student MuleSoft Meetup#6 - Basic Understanding of DataWeave With...
Jitendra Bafna
 
1시간만에 머신러닝 개념 따라 잡기
1시간만에 머신러닝 개념 따라 잡기1시간만에 머신러닝 개념 따라 잡기
1시간만에 머신러닝 개념 따라 잡기
Sungmin Kim
 
Clustering
ClusteringClustering
Clustering
butest
 
Microsoft Power BI Online Training.pdf
Microsoft Power BI Online Training.pdfMicrosoft Power BI Online Training.pdf
Microsoft Power BI Online Training.pdf
SpiritsoftsTraining
 
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
GeeksLab Odessa
 
chapter1.pdf
chapter1.pdfchapter1.pdf
chapter1.pdf
ssuser75b6b3
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
halifaxchester
 
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Naoki Nakatani
 
Dqs mds-matching 15042015
Dqs mds-matching 15042015Dqs mds-matching 15042015
Dqs mds-matching 15042015
Neil Hambly
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
Sanjeev Mishra
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - Slides
Aditya Joshi
 

Similar to Improving Graph Based Entity Resolution with Data Mining and NLP (20)

Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine Learning
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
Bootstrapping Entity Alignment with Knowledge Graph Embedding
Bootstrapping Entity Alignment with Knowledge Graph EmbeddingBootstrapping Entity Alignment with Knowledge Graph Embedding
Bootstrapping Entity Alignment with Knowledge Graph Embedding
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B Tests
 
Talk pg conf eu 2013
Talk pg conf eu 2013Talk pg conf eu 2013
Talk pg conf eu 2013
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
 
Engineering Student MuleSoft Meetup#6 - Basic Understanding of DataWeave With...
Engineering Student MuleSoft Meetup#6 - Basic Understanding of DataWeave With...Engineering Student MuleSoft Meetup#6 - Basic Understanding of DataWeave With...
Engineering Student MuleSoft Meetup#6 - Basic Understanding of DataWeave With...
 
1시간만에 머신러닝 개념 따라 잡기
1시간만에 머신러닝 개념 따라 잡기1시간만에 머신러닝 개념 따라 잡기
1시간만에 머신러닝 개념 따라 잡기
 
Clustering
ClusteringClustering
Clustering
 
Microsoft Power BI Online Training.pdf
Microsoft Power BI Online Training.pdfMicrosoft Power BI Online Training.pdf
Microsoft Power BI Online Training.pdf
 
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
 
chapter1.pdf
chapter1.pdfchapter1.pdf
chapter1.pdf
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
 
Dqs mds-matching 15042015
Dqs mds-matching 15042015Dqs mds-matching 15042015
Dqs mds-matching 15042015
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - Slides
 

Recently uploaded

ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
Maitrey Patel
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
dakas1
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
Alina Yurenko
 
Orca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container OrchestrationOrca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container Orchestration
Pedro J. Molina
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
campbellclarkson
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
kgyxske
 
What’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete RoadmapWhat’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete Roadmap
Envertis Software Solutions
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
Marcin Chrost
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Boost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management AppsBoost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management Apps
Jhone kinadey
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
ShulagnaSarkar2
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdfThe Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
kalichargn70th171
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
kalichargn70th171
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
Alberto Brandolini
 

Recently uploaded (20)

ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
 
Orca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container OrchestrationOrca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container Orchestration
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
 
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
 
What’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete RoadmapWhat’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete Roadmap
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Boost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management AppsBoost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management Apps
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdfThe Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
 

Improving Graph Based Entity Resolution with Data Mining and NLP

Editor's Notes

  1. Test text for sizing
  2. Not an architect that just draws boxes and lines, I get my hands dirty by actually helping to build these things
  3. What this means is resolving data from one or more datasets into a canonical representation of that entity. E.g. I have facebook, linkedin, google, twitter etc but there is only one singular entity that is me. Entity resolution is the process of taking each of those disparate data sources and linking them to the singular real world me entity. Entity Resolution is not a new problem, its one that has become more important as we get more and more representation of yourself and we want mine interesting data from them
  4. Deduplication, Record Linkage, Data referencing, Canonicalization, Coreference resolution, Merge/purge, Object identification, Entity clustering, Object consolidation, Identity uncertainty, Reference reconciliation
  5. Its not if you have structured/clean and consistent data, but in reality it isnt Dave versus David Mispelled names Missing items Wife changed name
  6. Canonical Examples - Countries of the world (195), Fortune 500 companies Non-canonical examples - probably the most common, the canonical list has to be made from the data Examples are: people, places, products
  7. Not going to talk about Dedupe or blocking clustering A little bit on canonicalization but mostly on linking records
  8. MDM - Getting master data from multiple systems Customers - linking customers from multiple different internal systems (email, chat, phone) Rec engines - Linking sales and product data across divisions Intrustion detection - linking IP spoofs to the same person Fraud - Linking fraudulent transactions on multiple cards to same person
  9. Combining the best of Graph techniques with standard data mining and NLP techniques to provide a better outcome
  10. Lots of different String similarity - The process of comparing two strings and finding out how similar/dissimilar they are Named Entity Recognition - Process of classifying entities in text into predefined categories Shingling - process of tokenizing data to gauge similarity
  11. Aggregating Traversals - Using traversals to calculate weighed sums Pattern Matching - find patterns Inferring relationships Path traversals
  12. g.V().hasLabel("product"). group(). by(values('title').fold()). unfold(). filter(select(values).count(local).is(gt(1))).count()
  13. g.V().hasLabel("product"). group(). by(values('title').fold()). unfold(). filter(select(values).count(local).is(gt(1))).count()
  14. g.V().hasLabel("product"). group(). by(values('title').fold()). unfold(). filter(select(values).count(local).is(gt(1))).count()
  15. You may wonder why we added unique manufacturers from the google data to our graph if we aren’t matching on them
  16. g.V().hasLabel("product"). group(). by(values('title').fold()). unfold(). filter(select(values).count(local).is(gt(1))).count()
  17. NER works by using labelled training set data to determine entities Used canonical manufacturers as training set data Input the titles
  18. Good for comparing shorter string segments like names
  19. TF-IDF turns each document into a vector of numbers Values are then normalized using the dot product Cosine similarity compares the normalized vectors
  20. Produces a normalized vector of relative importance of words
  21. Similar scores are close to 1 Unrelated scores are close to 0 Opposites are close to -1
  22. Summed up the distance between items with cosine similarity, jaccard index and simplest path traversal where distance<=2 and length<=3
  23. Locality Sensitive Hashing - create hash codes for data to find others most like it
  24. Apache Commons for cosine-similarity and Jaccard Index Java Similairty for Damerau-Levensthein OpenNLP - for tokenizing and NER Tinkerpop for traversals