SlideShare a Scribd company logo
1 of 32
Download to read offline
Real Time Fuzzy Matching With Spark and
ElasticSearch
BFSI
Wilful Defaulters?
Sanctions Screening
PEP
HMT
OFAC SDN
..and many others
However ...
7TH OF TIR
7TH OF TIR COMPLEX
7TH OF TIR INDUSTRIAL COMPLEX
7TH OF TIR INDUSTRIES
7TH OF TIR INDUSTRIES OF ISFAHAN/ESFAHAN
SEVENTH of Tir
Entity Resolution
Directory Listings
De
Dew Drops, Shop no - A-152, super mart 1,
Gurgaon - 122001, DLF Phase 4
DewDrop Florist, A 152, DLF City Phase 4, Near
Galleria Market, Super Mart 1
Ecommerce
Cherry Mobile Amethyst Android 4.2 Jelly Bean (Black) with
Free Smart and Globe SIM
Cherry Mobile Amethyst (White) with 1 Smart SIM
CHERRY MOBILE AMETHYST + 1 SMART SIM
Cherry Mobile Amethyst Android 4.2 Jelly Bean
Cherry Mobile Amethyst (White) with 1 Samsung Galaxy V
CHERRY MOBILE AMETHYST + 1 SAMSUNG GALAXY V. +
1 SMART AND GLOBE SIM
Government of ..
● Benefit rollouts
● Surveillance
● Licenses
● Linking NPR with Passport
360 view
ID Company Name Project
12345 UBM Asia Dave Chan HK - Fine Jewellery
13222 UBM A Dave C HK - Fashion
Jewellery
15656 UBM Davechan HK - Beauty
14456 ubmAsia Mr. Dave CChan HK - Fine Jewellery
“In order to be irreplaceable, one must always
be different.”
― Coco Chanel
Other uses
● Cross selling
● Data Quality
● Vendor consolidation
● Master Data Management
● CRM Deduplication
Challenges
● Discovering and maintaining rules is extremely
tough
● Custom coding and domain specific logic
makes maintenance a nightmare
● No one size fits all, big custom
implementations needed every time even after
using existing tools
Challenges..
● High Data volumes
● Each record has multiple dimensions
● Exact matches are rare
● Comparing each record with every other is not
possible
● Languages have unique issues
Lets start wishing...
● Data variety
● Scalable
● No manual configuration of rules or algorithms
● Multi language
● Real time
Our Approach
- Learn from the data
- Divide the load
Reifier Workflow
Configure
data
Reifier
Interactive
Learner
Linked
Result
Have training data?
Reifier
Match
Yes
No
1. Select Data
2. Field Selection and Stop Words
Strata Hadoop World Singapore 2015
3. Choose Training Set
Strata Hadoop World Singapore 2015
4. Run the Spark Job
Strata Hadoop World Singapore 2015
5. Enjoy the results
Strata Hadoop World Singapore 2015
At the beginning: (Without Chinese Stopped words)
亚洲博闻有限公司 Dave Chan
亚洲华乐有限公司 David Chan
In this case, the similarity between 2
records is very high
What if we include the stopped word? (亚
洲,有限公司)
博闻 Dave Chan
华乐 David Chan
Company names for these records now are not matched at all and the system
will not group them together.
Fuzzy Match in Reifier – Stopped word
Reifier Interactive Learner
Reifier Interactive Learner
Reifier Interactive Learner
Reifier Interactive Learner
Spark Benefits
● Distributed
● Scalable
● Fast
● Machine Learning
● Sampling
● No need to orchestrate multiple jobs
Real Time
Spark + ElasticSearch
Advantages
● Point and Shoot - Zero config
● Learning similarity definitions from data
■ - No hard coding of business rules
■ - Domain agnostic
■ - Handle multiple languages (English,
Chinese, Japanese, Thai)
Advantages
● Scalability
● Real time as well as batch
Thank You!
www.nubetech.co
sonal@nubetech.co

More Related Content

Viewers also liked

Hap clojure berlin 2015
Hap clojure berlin 2015Hap clojure berlin 2015
Hap clojure berlin 2015alexanderkiel
 
Energia: quali scenario per il futuro
Energia: quali scenario per il futuroEnergia: quali scenario per il futuro
Energia: quali scenario per il futuroValeria Termini
 
11878 презентация microsoft power point
11878 презентация microsoft power point11878 презентация microsoft power point
11878 презентация microsoft power pointstepanyuk434
 
Import export procedure flowchart
Import export procedure flowchartImport export procedure flowchart
Import export procedure flowchartTushar G
 
New microsoft office power point presentation
New microsoft office power point presentationNew microsoft office power point presentation
New microsoft office power point presentationSusmoy Dash
 

Viewers also liked (9)

Tutorial prezzi
Tutorial prezziTutorial prezzi
Tutorial prezzi
 
Guia 1
Guia 1Guia 1
Guia 1
 
Hap clojure berlin 2015
Hap clojure berlin 2015Hap clojure berlin 2015
Hap clojure berlin 2015
 
Energia: quali scenario per il futuro
Energia: quali scenario per il futuroEnergia: quali scenario per il futuro
Energia: quali scenario per il futuro
 
11878 презентация microsoft power point
11878 презентация microsoft power point11878 презентация microsoft power point
11878 презентация microsoft power point
 
Паримуха О. С. Денний сон
Паримуха О. С. Денний сонПаримуха О. С. Денний сон
Паримуха О. С. Денний сон
 
Actividades guia 2
Actividades guia 2Actividades guia 2
Actividades guia 2
 
Import export procedure flowchart
Import export procedure flowchartImport export procedure flowchart
Import export procedure flowchart
 
New microsoft office power point presentation
New microsoft office power point presentationNew microsoft office power point presentation
New microsoft office power point presentation
 

Similar to Real Time Fuzzy Matching With Spark and Elasticsearch

Lightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB ApplicationsLightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB ApplicationsMongoDB
 
SAP Tech Innovation for Business - 2014.05
SAP Tech Innovation for Business - 2014.05SAP Tech Innovation for Business - 2014.05
SAP Tech Innovation for Business - 2014.05Vitaliy Rudnytskiy
 
Social media analytics bod
Social media analytics bodSocial media analytics bod
Social media analytics bodStuart Meagher
 
Investor Presentation Q4 2014
Investor Presentation Q4 2014Investor Presentation Q4 2014
Investor Presentation Q4 2014teradata2014
 
Investor Presentation Q3 2014
Investor Presentation Q3 2014Investor Presentation Q3 2014
Investor Presentation Q3 2014teradata2014
 
Investor presentation q2 2014
Investor presentation q2 2014Investor presentation q2 2014
Investor presentation q2 2014teradata2014
 
Big data tim
Big data timBig data tim
Big data timT Weir
 
Lightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB ApplicationsLightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB ApplicationsMongoDB
 
5 Proven Strategies For a Successful Analytics Product Launch
5 Proven Strategies For a Successful Analytics Product Launch5 Proven Strategies For a Successful Analytics Product Launch
5 Proven Strategies For a Successful Analytics Product LaunchGoodData
 
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.Milomir Vojvodic
 
Leveraging Generative AI: Exploring New Technology for Data Integration
Leveraging Generative AI: Exploring New Technology for Data IntegrationLeveraging Generative AI: Exploring New Technology for Data Integration
Leveraging Generative AI: Exploring New Technology for Data IntegrationSafe Software
 
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco AmalfiHow to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco AmalfiSocial Media Camp
 
Learn How-To Build Your IoT Project
Learn How-To Build Your IoT ProjectLearn How-To Build Your IoT Project
Learn How-To Build Your IoT ProjectDr. Mazlan Abbas
 
2012 Converge "Wanting to buy from you" Institute of Search, Social and Mobil...
2012 Converge "Wanting to buy from you" Institute of Search, Social and Mobil...2012 Converge "Wanting to buy from you" Institute of Search, Social and Mobil...
2012 Converge "Wanting to buy from you" Institute of Search, Social and Mobil...Kevin Cox
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopAdam Muise
 
Version 9 sales process
Version 9 sales processVersion 9 sales process
Version 9 sales processkmussel
 
Introduction To R
Introduction To RIntroduction To R
Introduction To RSpotle.ai
 
Suite on hana ile iş süreçlerinize hız katın
Suite on hana ile iş süreçlerinize hız katınSuite on hana ile iş süreçlerinize hız katın
Suite on hana ile iş süreçlerinize hız katınitelligence TR
 
Slash n 2018 - Just In Time Personalization
Slash n  2018 - Just In Time Personalization Slash n  2018 - Just In Time Personalization
Slash n 2018 - Just In Time Personalization FlipkartStories
 
Sega networks presentation 20140620
Sega networks presentation 20140620Sega networks presentation 20140620
Sega networks presentation 20140620Robin Ng
 

Similar to Real Time Fuzzy Matching With Spark and Elasticsearch (20)

Lightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB ApplicationsLightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB Applications
 
SAP Tech Innovation for Business - 2014.05
SAP Tech Innovation for Business - 2014.05SAP Tech Innovation for Business - 2014.05
SAP Tech Innovation for Business - 2014.05
 
Social media analytics bod
Social media analytics bodSocial media analytics bod
Social media analytics bod
 
Investor Presentation Q4 2014
Investor Presentation Q4 2014Investor Presentation Q4 2014
Investor Presentation Q4 2014
 
Investor Presentation Q3 2014
Investor Presentation Q3 2014Investor Presentation Q3 2014
Investor Presentation Q3 2014
 
Investor presentation q2 2014
Investor presentation q2 2014Investor presentation q2 2014
Investor presentation q2 2014
 
Big data tim
Big data timBig data tim
Big data tim
 
Lightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB ApplicationsLightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB Applications
 
5 Proven Strategies For a Successful Analytics Product Launch
5 Proven Strategies For a Successful Analytics Product Launch5 Proven Strategies For a Successful Analytics Product Launch
5 Proven Strategies For a Successful Analytics Product Launch
 
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.
 
Leveraging Generative AI: Exploring New Technology for Data Integration
Leveraging Generative AI: Exploring New Technology for Data IntegrationLeveraging Generative AI: Exploring New Technology for Data Integration
Leveraging Generative AI: Exploring New Technology for Data Integration
 
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco AmalfiHow to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
 
Learn How-To Build Your IoT Project
Learn How-To Build Your IoT ProjectLearn How-To Build Your IoT Project
Learn How-To Build Your IoT Project
 
2012 Converge "Wanting to buy from you" Institute of Search, Social and Mobil...
2012 Converge "Wanting to buy from you" Institute of Search, Social and Mobil...2012 Converge "Wanting to buy from you" Institute of Search, Social and Mobil...
2012 Converge "Wanting to buy from you" Institute of Search, Social and Mobil...
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
 
Version 9 sales process
Version 9 sales processVersion 9 sales process
Version 9 sales process
 
Introduction To R
Introduction To RIntroduction To R
Introduction To R
 
Suite on hana ile iş süreçlerinize hız katın
Suite on hana ile iş süreçlerinize hız katınSuite on hana ile iş süreçlerinize hız katın
Suite on hana ile iş süreçlerinize hız katın
 
Slash n 2018 - Just In Time Personalization
Slash n  2018 - Just In Time Personalization Slash n  2018 - Just In Time Personalization
Slash n 2018 - Just In Time Personalization
 
Sega networks presentation 20140620
Sega networks presentation 20140620Sega networks presentation 20140620
Sega networks presentation 20140620
 

Recently uploaded

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Real Time Fuzzy Matching With Spark and Elasticsearch