Real time entity resolution with elasticsearch - haystack 2018

OpenSource Connections
OpenSource ConnectionsPrincipal, OpenSource Connections and Solr Consultant at OpenSource Connections
Dave Moore
david.moore@elastic.co
Real-Time Entity Resolution
With Elasticsearch
1
Disambiguation
Entity Entity
Single attributes in unstructured text
"Named Entity Recognition"
Multiple attributes in structured data
"Entity Resolution"
vs.
Person
Field Value
Name Alice Jones
DOB 1984-01-01
Street 123 Main St
Credit Card 4040 0000 2020 8080
Phone 202-555-1234
2
What is entity resolution?
Health Care
Patient ID
We need to identify
and their medical
many hand-written
Mixing up records puts
at risk of injury or
Sales & Marketing
Customer Intel
We have reps
managing many
sources of info on
leads and customers.
Our view of the buyer
is fragmented and that
makes us less
effective. We're losing
pipeline.
Security & Compliance
Fraud
We need to track a
person or device that is
hiding its tracks.
Connecting the dots is
a
laborious process and
we can't keep up with
our incident backlog.
Military, IC, Law
Surveillance
We need to track a
person or device that is
hiding its identity. Our
timely success is
critical to public safety
and national security.
Privacy Compliance
GDPR
We must find and
manage all PII to
respond to inquiries.
Failure to comply risks
fines of €20 million or
4% annual turnover.
IT
MDM
MDM is a slow and
bureaucratic process.
We can solve our own
data quality problems
faster and better. And
we still need query
time entity resolution.
3
Examples
4
Why is identity hard to track?
Ali Jones
123 W Main Street
ABC Wigdets
4040 0000 2020 8008
+1 (202) 555 1234
5
1. Identity is Vague
Allie Jones
123 Main St
ABC Widgets, Inc.
4040 0000 2020 8080
202-555-1234
Icons by icons8
Ali Jones
123 W Main Street
ABC Wigdets
4040 0000 2020 8008
+1 (202) 555 1234
Alison Jones-Smith
555 Brooad Street
XYZ Tech
3030 5500 9999 0000
2025559867
6
2. Identity Changes
Allie Jones
123 Main St
ABC Widgets, Inc.
4040 0000 2020 8080
202-555-1234
Allison Smith
555 Broad St
XYZ Technology Corp.
3030 5050 9999 0000
202-555-9876
Icons by icons8
Ali Jones
123 W Main Street
ABC Wigdets
4040 0000 2020 8008
+1 (202) 555 1234
Alison Jones-Smith
555 Brooad Street
XYZ Tech
3030 5500 9999 0000
2025559867
7
3. Identity is Messy
Allie Jones
123 Main St
ABC Widgets, Inc.
4040 0000 2020 8080
202-555-1234
Allison Smith
555 Broad St
XYZ Technology Corp.
3030 5050 9999 0000
202-555-9876
Icons by icons8
8
4. Identity is Diverse
Ali Jones
123 W Main Street
ABC Wigdets
4040 0000 2020 8008
+1 (202) 555 1234
Alison Jones-Smith
555 Brooad Street
XYZ Tech
3030 5500 9999 0000
2025559867
Allie Jones
123 Main St
ABC Widgets, Inc.
4040 0000 2020 8080
202-555-1234
Allison Smith
555 Broad St
XYZ Technology Corp.
3030 5050 9999 0000
202-555-9876
???
???
???
???
Icons by icons8
9
Entity Resolution
connects the dots despite these challenges
Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234
Allie Jones 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234
Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234
Allie Jones 132 W Main Street ABC Widgets 4040 0000 2020 8080 202 555 1234
Allie Smith 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234
Allie Smith 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234
Ali Smith 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234
Allie Smith 555 Broad St ABC Widgets, Inc 4040 0000 2020 8080 202-555-1234
Allie Smith 555 Broad Street XYZ Tech Corp 3030 5050 9999 0000 202.555.1234
Allie Smith 555 Broad Street XYZ Technology Corp 3030 5050 9999 0000 202-555-9876
10
Comparison to Search
Search Resolution
name:"Allie Jones" AND street:"123 Main St" name:"Allie Jones" AND street:"123 Main St"
Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234
Allie Jones 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234
Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234
Ali Jones 132 Mane Street ABC Widgets 4024 0071 4970 1227 888-555-5555
Aly Jonas 113 Main Street Acme Corp. 4716 1035 4536 4671 610-555-5555
Allie Jones 132 W Main Street ABC Widgets 4040 0000 2020 8080 202-555-9876
Al Jones 132 E Main St Mom & Pop, LLC 3772 733741 52501 1-610-555-0000
Aly Jones 113 Main St, #102 Acme Corp. 4716 1035 4536 4671 610-555-5555
Ali Jones 132 Mane Street ABC Widgets 4024 0071 4970 1227 888-555-1234
Aly Jonas 113 Main Street Acme Corp. 4781 9105 0533 4481 610-555-2345
Allie Johns 132 W Main Street ABC Widgets 4088 0110 2044 8180 202-555-3456
Elle Jeon 132 E Main St Mom & Pop, LLC 3502 730741 52203 1-610-555-4567
Elle Jones 113 Main St, #102 Acme Corp. 4716 1035 4536 4671 610-555-5678
Eli Jones 132 Mane Street ABC Widgets 4224 0065 4800 1337 888-555-6789
Eli Joans 113 Main Street Acme Corp. 4206 1035 4536 4081 610-555-7890
Allie Jeans 132 N Mean Street ABC Widgets 4240 0101 02020 8888 202-555-8901
Search engine ranks results once.
True hits mixed with noise.
Search engine filters results recursively.
True hits isolated and transitively linked.
11
Real-Time
12
Batch vs. Real-Time
Batch Real-Time
How is it used? Resolve all entities in advance
(Partitioning, pairwise scoring, connected
components)
How long does it take? Docs + (Docs/Partitions)2 + Components2
(Hours for billions of documents)
When is it necessary? Population or network analysis
Most solutions have a real-time phase,
sometimes applied after batch resolution.
How is it used? Resolve one entity on query
(Recursive Boolean query)
How long does it take? Indices * Attributes * Hops
(Milliseconds for a handful of each)
When is it necessary? Individual analysis
Robust matching
• Token normalization
• Phonetic matching
• Fuzzy transpositions
• Boolean logic filtering
• Fine-tune search parameters
13
Real-Time
Why Elasticsearch
Suited for operations
• Horizontal scaling
• Real-time response rates
• Flexible index mappings
14
Approach
• Fast – Get results in real-time. From milliseconds to low seconds.
• Generic – Resolve any type of entity. People, companies, locations, sessions, etc.
• Transitive – Resolve over multiple hops of matches. Capture changing identities.
• Multi-source – Resolve over multiple indices with disparate mappings.
• Accommodating – Operate on data as it exists. Avoid transforming and reindexing
data.
• Logical – Logic is easier to read, troubleshoot, and optimize than statistics.
• 100% Elasticsearch – Operate within existing search infrastructure.
Goals
15
Approach
1. Entity modeling – What is the entity? What are its attributes?
2. Analyzers – How are you indexing each attribute?
3. Matchers – What is the query logic for each attribute?
4. Resolvers – What combinations of matching attributes imply a resolution?
5. Metadata maps – Which matchers apply to which indexed fields?
6. Recursive queries – How to repeat the queries until completion?
Steps
16
zentity
zentity.io
An open source Elasticsearch plugin
for real-time entity resolution
zentity
zentity.io
An open source Elasticsearch plugin
for real-time entity resolution
17
POST _zentity/resolution/person
{
"attributes": {
"name": "Alice Jones",
"dob": "1984-01-01",
"phone": [ "555-123-4567", "555-987-6543" ]
}
}
18
Demos
19
Demos
Customer intelligence
Gather everything we know about a customer.
Web traffic sessionization
Track a bot that cycles through IP addresses, cookies, and user agent signatures.
Fraud detection
Determine if a health care provider was blacklisted under a different name.
Dave Moore
email: david.moore@elastic.co
zentity: zentity.io
Contact
@elastic
www.elastic.co
Extra Content
22
Approach
23
Step 1. Entity Modeling
Person
Name the entity type.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Define its attributes. Study them in your data sets.
Uniqueness Consistency Presence
Moderate
Moderate
High
Low
Low
Low
Low
Moderate
Moderate
High
High
Extreme
Extreme
Moderate
Moderate
Low
Moderate
High
High
High
High
Moderate
Extreme
Extreme
Extreme
High
Extreme
High
Moderate
Moderate
High
High
High
Moderate
Moderate
Moderate
Low
Low
None
Icons by icons8
24
Step 1. Entity Modeling
Person
Name the entity type.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Define its attributes. Study them in your data sets.
Uniqueness Consistency Presence
Moderate
Moderate
High
Low
Low
Low
Low
Moderate
Moderate
High
High
Extreme
Extreme
Moderate
Moderate
Low
Moderate
High
High
High
High
Moderate
Extreme
Extreme
Extreme
High
Extreme
High
Moderate
Moderate
High
High
High
Moderate
Moderate
Moderate
Low
Low
None
This model is independent from your indices.
You can reuse and extend this model as you add or amend indices.
Icons by icons8
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Phonetic
"Alice Jones" => ["ALAC","JAN"]
Standard
"Alice Jones" => ["ALICE","JONES"]
25
Step 2. Analyzers
Take the attributes. Define their analyzers. Put them in your index mappings.
{
"settings": {
"index": {
"analysis": {
"filter": {
"phonetic": {
"type": "phonetic",
"encoder": "nysiis"
}
},
"analyzer": {
"phonetic": {
"filter": [
"icu_normalizer",
"icu_folding",
"phonetic"
],
"tokenizer": "standard"
}
}
}
}
}
}
{
"mappings": {
"_doc": {
"properties": {
“first_name": {
"type": "text",
"fields": {
"phonetic": {
"type": "text",
"analyzer": "phonetic"
}
}
}
}
}
}
}
Person
Icons by icons8
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Phonetic
"Alice Jones" => ["ALAC","JAN"]
Standard
"Alice Jones" => ["ALICE","JONES"]
26
Step 2. Analyzers
Take the attributes. Define their analyzers. Put them in your index mappings.
{
"settings": {
"index": {
"analysis": {
"filter": {
"phonetic": {
"type": "phonetic",
"encoder": "nysiis"
}
},
"analyzer": {
"phonetic": {
"filter": [
"icu_normalizer",
"icu_folding",
"phonetic"
],
"tokenizer": "standard"
}
}
}
}
}
}
{
"mappings": {
"_doc": {
"properties": {
“first_name": {
"type": "text",
"fields": {
"phonetic": {
"type": "text",
"analyzer": "phonetic"
}
}
}
}
}
}
}
Person
Analyzers are powerful. But they must be defined prior to indexing.
Give careful thought to your analyzers to avoid having to reindex data.
Icons by icons8
Phonetic
{
"match": {
"{{ field }}": {
"query": "{{ value }}",
"fuzziness": 0
}
}
}
Standard
{
"match": {
"{{ field }}": {
"query": "{{ value }}",
"fuzziness": 2
}
}
}
27
Step 3. Matchers
Take the attributes.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Define their Boolean query logic. Use templates for variables.
Person
{{ field }} – The field of an index.
{{ value }} – The value of an attribute.
We will replace these at query time.
Icons by icons8
Phonetic
{
"match": {
"{{ field }}": {
"query": "{{ value }}“,
"fuzziness": 0
}
}
}
Standard
{
"match": {
"{{ field }}": {
"query": "{{ value }}“,
"fuzziness": 2
}
}
}
28
Step 3. Matchers
Take the attributes.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Define their Boolean query logic. Use templates for variables.
Person
{{ field }} – The field of an index.
{{ value }} – The value of an attribute.
We will replace these at query time.
Understand that each matcher will be combined
into one large Boolean query.
Icons by icons8
29
Step 4. Resolvers
Take the attributes.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Determine which combinations of matching attributes imply a resolution.
[ Name – First, Name – Last, Address – Street, Address – City, Address – State ]
[ Name – First, Name – Last, Address – Street, Address – Postal Code ]
[ Name – First, Name – Last, Date of Birth, Address – City, Address – State ]
[ Name – First, Name – Last, Date of Birth, Address – Postal Code ]
[ Name – First, Name – Last, Phone Number ]
[ Name – First, Name – Last, Email Address ]
[ Name – First, Name – Last, IP Address ]
[ Name – First, Name – Last, Credit Card Number ]
[ Name – First, Name – Last, Social Security Number]
[ Email Address, Phone Number ]
[ Email Address, IP Address ]
[ Email Address, Credit Card Number ]
[ IP Address, Credit Card Number ]
Person
Icons by icons8
30
Step 4. Resolvers
Take the attributes.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Determine which combinations of matching attributes imply a resolution.
[ Name – First, Name – Last, Address – Street, Address – City, Address – State ]
[ Name – First, Name – Last, Address – Street, Address – Postal Code ]
[ Name – First, Name – Last, Date of Birth, Address – City, Address – State ]
[ Name – First, Name – Last, Date of Birth, Address – Postal Code ]
[ Name – First, Name – Last, Phone Number ]
[ Name – First, Name – Last, Email Address ]
[ Name – First, Name – Last, IP Address ]
[ Name – First, Name – Last, Credit Card Number ]
[ Name – First, Name – Last, Social Security Number]
[ Email Address, Phone Number ]
[ Email Address, IP Address ]
[ Email Address, Credit Card Number ]
[ IP Address, Credit Card Number ]
Person
Avoid resolving on a single attribute such as Social Security Number.
Corroboration among multiple attributes helps prevent snowballs.
Icons by icons8
31
Step 5. Metadata Maps
Take the attributes.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Map them to the fields of the relevant indices.
users.first_name
users.last_name
users.phone
users.email
customers:fname
customers:lname
customers:tel
customers:email
customers:cc
customers:zip
Person
Icons by icons8
32
Step 6. Recursive Queries
With each query, new inputs might be found in different attributes.
Use the metadata map and your resolvers to determine if you can
create new queries for the new inputs.
1 of 33

Recommended

Zipline—Airbnb’s Declarative Feature Engineering Framework by
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkDatabricks
3K views42 slides
Cognitive Search: Announcing the smartest enterprise search engine, now with ... by
Cognitive Search: Announcing the smartest enterprise search engine, now with ...Cognitive Search: Announcing the smartest enterprise search engine, now with ...
Cognitive Search: Announcing the smartest enterprise search engine, now with ...Microsoft Tech Community
619 views49 slides
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy... by
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...Databricks
2K views16 slides
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud by
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudMárton Kodok
1.2K views50 slides
Zestimate Lambda Architecture by
Zestimate Lambda ArchitectureZestimate Lambda Architecture
Zestimate Lambda ArchitectureSteven Hoelscher
3.4K views26 slides
Future of Data and AI in Retail - NRF 2023 by
Future of Data and AI in Retail - NRF 2023Future of Data and AI in Retail - NRF 2023
Future of Data and AI in Retail - NRF 2023Rob Saker
639 views46 slides

More Related Content

What's hot

Data Science At Zillow by
Data Science At ZillowData Science At Zillow
Data Science At ZillowNicholas McClure
45.6K views64 slides
Modern Data Challenges require Modern Graph Technology by
Modern Data Challenges require Modern Graph TechnologyModern Data Challenges require Modern Graph Technology
Modern Data Challenges require Modern Graph TechnologyNeo4j
104 views35 slides
Data Engineering Basics by
Data Engineering BasicsData Engineering Basics
Data Engineering BasicsCatherine Kimani
5.6K views25 slides
Orion Context Broker NGSI-v2 Overview for Developers That Already Know NGSI-v... by
Orion Context Broker NGSI-v2 Overview for Developers That Already Know NGSI-v...Orion Context Broker NGSI-v2 Overview for Developers That Already Know NGSI-v...
Orion Context Broker NGSI-v2 Overview for Developers That Already Know NGSI-v...Fermin Galan
300 views58 slides
Everything you ever wanted to know about external sharing in Microsoft 365 - ... by
Everything you ever wanted to know about external sharing in Microsoft 365 - ...Everything you ever wanted to know about external sharing in Microsoft 365 - ...
Everything you ever wanted to know about external sharing in Microsoft 365 - ...Chirag Patel
295 views38 slides
Sentiment Analysis on Twitter by
Sentiment Analysis on TwitterSentiment Analysis on Twitter
Sentiment Analysis on TwitterSubarno Pal
817 views37 slides

What's hot(20)

Modern Data Challenges require Modern Graph Technology by Neo4j
Modern Data Challenges require Modern Graph TechnologyModern Data Challenges require Modern Graph Technology
Modern Data Challenges require Modern Graph Technology
Neo4j104 views
Orion Context Broker NGSI-v2 Overview for Developers That Already Know NGSI-v... by Fermin Galan
Orion Context Broker NGSI-v2 Overview for Developers That Already Know NGSI-v...Orion Context Broker NGSI-v2 Overview for Developers That Already Know NGSI-v...
Orion Context Broker NGSI-v2 Overview for Developers That Already Know NGSI-v...
Fermin Galan300 views
Everything you ever wanted to know about external sharing in Microsoft 365 - ... by Chirag Patel
Everything you ever wanted to know about external sharing in Microsoft 365 - ...Everything you ever wanted to know about external sharing in Microsoft 365 - ...
Everything you ever wanted to know about external sharing in Microsoft 365 - ...
Chirag Patel295 views
Sentiment Analysis on Twitter by Subarno Pal
Sentiment Analysis on TwitterSentiment Analysis on Twitter
Sentiment Analysis on Twitter
Subarno Pal817 views
Neo4j Training Cypher by Max De Marzi
Neo4j Training CypherNeo4j Training Cypher
Neo4j Training Cypher
Max De Marzi684 views
Foundry technical intro by esseemme69
Foundry technical introFoundry technical intro
Foundry technical intro
esseemme69483 views
Mongodb introduction and_internal(simple) by Kai Zhao
Mongodb introduction and_internal(simple)Mongodb introduction and_internal(simple)
Mongodb introduction and_internal(simple)
Kai Zhao1.3K views
Real-Time Entity Resolution with Elasticsearch - Haystack 2018 by zentity.io
Real-Time Entity Resolution with Elasticsearch - Haystack 2018Real-Time Entity Resolution with Elasticsearch - Haystack 2018
Real-Time Entity Resolution with Elasticsearch - Haystack 2018
zentity.io1.1K views
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff... by DataStax
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
DataStax5.4K views
Introducing Amazon Connect-Keynote-Enterprise Connect 2017 by Amazon Web Services
Introducing Amazon Connect-Keynote-Enterprise Connect 2017Introducing Amazon Connect-Keynote-Enterprise Connect 2017
Introducing Amazon Connect-Keynote-Enterprise Connect 2017
Amazon Web Services1.6K views
iOS Application Penetration Testing for Beginners by RyanISI
iOS Application Penetration Testing for BeginnersiOS Application Penetration Testing for Beginners
iOS Application Penetration Testing for Beginners
RyanISI8.4K views
Machine Learning with PyCarent + MLflow by Databricks
Machine Learning with PyCarent + MLflowMachine Learning with PyCarent + MLflow
Machine Learning with PyCarent + MLflow
Databricks589 views
Azure data analytics platform - A reference architecture by Rajesh Kumar
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture
Rajesh Kumar484 views
Elasticsearch for beginners by Neil Baker
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
Neil Baker15.5K views
Next CERN Accelerator Logging Service with Jakub Wozniak by Spark Summit
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit2.3K views

Similar to Real time entity resolution with elasticsearch - haystack 2018

In:Confidence 2019 - Balancing the conflicting objectives of data access and ... by
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...Privitar
297 views13 slides
Privacy solutions decode2021_jon_oliver by
Privacy solutions decode2021_jon_oliverPrivacy solutions decode2021_jon_oliver
Privacy solutions decode2021_jon_oliverJonathanOliver26
141 views35 slides
How We Did It: The Case of the Credit Card Breach by
How We Did It: The Case of the Credit Card BreachHow We Did It: The Case of the Credit Card Breach
How We Did It: The Case of the Credit Card BreachTeradata
2.4K views23 slides
Mastering Location Data – a new paradigm in network analytics by
Mastering Location Data – a new paradigm in network analyticsMastering Location Data – a new paradigm in network analytics
Mastering Location Data – a new paradigm in network analyticsPrecisely
134 views25 slides
Global AI Bootcamp Singapore - Keynote by
Global AI Bootcamp Singapore - KeynoteGlobal AI Bootcamp Singapore - Keynote
Global AI Bootcamp Singapore - KeynoteAlex Smith
258 views72 slides
The Domains of Identity & Self-Sovereign Identity MyData 2018 by
The Domains of Identity & Self-Sovereign Identity MyData 2018The Domains of Identity & Self-Sovereign Identity MyData 2018
The Domains of Identity & Self-Sovereign Identity MyData 2018Kaliya "Identity Woman" Young
1.9K views168 slides

Similar to Real time entity resolution with elasticsearch - haystack 2018(20)

In:Confidence 2019 - Balancing the conflicting objectives of data access and ... by Privitar
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
Privitar297 views
Privacy solutions decode2021_jon_oliver by JonathanOliver26
Privacy solutions decode2021_jon_oliverPrivacy solutions decode2021_jon_oliver
Privacy solutions decode2021_jon_oliver
JonathanOliver26141 views
How We Did It: The Case of the Credit Card Breach by Teradata
How We Did It: The Case of the Credit Card BreachHow We Did It: The Case of the Credit Card Breach
How We Did It: The Case of the Credit Card Breach
Teradata2.4K views
Mastering Location Data – a new paradigm in network analytics by Precisely
Mastering Location Data – a new paradigm in network analyticsMastering Location Data – a new paradigm in network analytics
Mastering Location Data – a new paradigm in network analytics
Precisely134 views
Global AI Bootcamp Singapore - Keynote by Alex Smith
Global AI Bootcamp Singapore - KeynoteGlobal AI Bootcamp Singapore - Keynote
Global AI Bootcamp Singapore - Keynote
Alex Smith258 views
All Clearances or Cyber Virtual Job Fair Handbook June 3, 2020, Huntsville by ClearedJobs.Net
All Clearances or Cyber Virtual Job Fair Handbook June 3, 2020, HuntsvilleAll Clearances or Cyber Virtual Job Fair Handbook June 3, 2020, Huntsville
All Clearances or Cyber Virtual Job Fair Handbook June 3, 2020, Huntsville
ClearedJobs.Net497 views
Trusting AI with important decisions by Louis Dorard
Trusting AI with important decisionsTrusting AI with important decisions
Trusting AI with important decisions
Louis Dorard1.2K views
Cybersecurity for Marketing by Alert Logic
Cybersecurity for Marketing Cybersecurity for Marketing
Cybersecurity for Marketing
Alert Logic 1.9K views
TECHNOLOGY: Solution to our woos not Politicians & INTERNET of THINGS in Nuts... by Ravi Chandra
TECHNOLOGY: Solution to our woos not Politicians & INTERNET of THINGS in Nuts...TECHNOLOGY: Solution to our woos not Politicians & INTERNET of THINGS in Nuts...
TECHNOLOGY: Solution to our woos not Politicians & INTERNET of THINGS in Nuts...
Ravi Chandra509 views
Huntsville All Clearances or Cyber Virtual Job Fair Handbook June 3 by ClearedJobs.Net
Huntsville All Clearances or Cyber Virtual Job Fair Handbook June 3Huntsville All Clearances or Cyber Virtual Job Fair Handbook June 3
Huntsville All Clearances or Cyber Virtual Job Fair Handbook June 3
ClearedJobs.Net359 views
Database Design Disasters by Richie Rump
Database Design DisastersDatabase Design Disasters
Database Design Disasters
Richie Rump698 views
What You Need to Know About Robotic Process Automation: How It Works & Real-W... by Captricity
What You Need to Know About Robotic Process Automation: How It Works & Real-W...What You Need to Know About Robotic Process Automation: How It Works & Real-W...
What You Need to Know About Robotic Process Automation: How It Works & Real-W...
Captricity5.8K views
What i learned at the infosecurity isaca north america expo and conference 2019 by Ulf Mattsson
What i learned at the infosecurity isaca north america expo and conference 2019What i learned at the infosecurity isaca north america expo and conference 2019
What i learned at the infosecurity isaca north america expo and conference 2019
Ulf Mattsson54 views
Internet of Things Primer by Stephen Bates
Internet of Things PrimerInternet of Things Primer
Internet of Things Primer
Stephen Bates1.2K views
Supporting IT by David Meares by Alex Cachia
Supporting IT by David MearesSupporting IT by David Meares
Supporting IT by David Meares
Alex Cachia67 views
Cleared Job Fair Job Seeker Handbook June 11, 2020 by ClearedJobs.Net
Cleared Job Fair Job Seeker Handbook June 11, 2020Cleared Job Fair Job Seeker Handbook June 11, 2020
Cleared Job Fair Job Seeker Handbook June 11, 2020
ClearedJobs.Net601 views

More from OpenSource Connections

Encores by
EncoresEncores
EncoresOpenSource Connections
2K views53 slides
Test driven relevancy by
Test driven relevancyTest driven relevancy
Test driven relevancyOpenSource Connections
272 views20 slides
How To Structure Your Search Team for Success by
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessOpenSource Connections
162 views25 slides
The right path to making search relevant - Taxonomy Bootcamp London 2019 by
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019OpenSource Connections
993 views56 slides
Payloads and OCR with Solr by
Payloads and OCR with SolrPayloads and OCR with Solr
Payloads and OCR with SolrOpenSource Connections
655 views22 slides
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull by
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullOpenSource Connections
498 views5 slides

More from OpenSource Connections(20)

The right path to making search relevant - Taxonomy Bootcamp London 2019 by OpenSource Connections
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull by OpenSource Connections
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison by OpenSource Connections
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ... by OpenSource Connections
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj by OpenSource Connections
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit... by OpenSource Connections
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl by OpenSource Connections
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger by OpenSource Connections
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh... by OpenSource Connections
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse... by OpenSource Connections
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Architectural considerations on search relevancy in the conte... by OpenSource Connections
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber... by OpenSource Connections
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Establishing a relevance focused culture in a large organizat... by OpenSource Connections
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz... by OpenSource Connections
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via by OpenSource Connections
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via

Recently uploaded

Microsoft Power Platform.pptx by
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
47 views38 slides
Unit 1_Lecture 2_Physical Design of IoT.pdf by
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdfStephenTec
11 views36 slides
Five Things You SHOULD Know About Postman by
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About PostmanPostman
27 views43 slides
Kyo - Functional Scala 2023.pdf by
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
165 views92 slides
handbook for web 3 adoption.pdf by
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdfLiveplex
19 views16 slides
Java Platform Approach 1.0 - Picnic Meetup by
Java Platform Approach 1.0 - Picnic MeetupJava Platform Approach 1.0 - Picnic Meetup
Java Platform Approach 1.0 - Picnic MeetupRick Ossendrijver
25 views39 slides

Recently uploaded(20)

Unit 1_Lecture 2_Physical Design of IoT.pdf by StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec11 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman27 views
handbook for web 3 adoption.pdf by Liveplex
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdf
Liveplex19 views
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana12 views
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb12 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2216 views
HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn19 views
1st parposal presentation.pptx by i238212
1st parposal presentation.pptx1st parposal presentation.pptx
1st parposal presentation.pptx
i2382129 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Empathic Computing: Delivering the Potential of the Metaverse by Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst470 views
Lilypad @ Labweek, Istanbul, 2023.pdf by Ally339821
Lilypad @ Labweek, Istanbul, 2023.pdfLilypad @ Labweek, Istanbul, 2023.pdf
Lilypad @ Labweek, Istanbul, 2023.pdf
Ally3398219 views

Real time entity resolution with elasticsearch - haystack 2018

  • 2. 1 Disambiguation Entity Entity Single attributes in unstructured text "Named Entity Recognition" Multiple attributes in structured data "Entity Resolution" vs. Person Field Value Name Alice Jones DOB 1984-01-01 Street 123 Main St Credit Card 4040 0000 2020 8080 Phone 202-555-1234
  • 3. 2 What is entity resolution?
  • 4. Health Care Patient ID We need to identify and their medical many hand-written Mixing up records puts at risk of injury or Sales & Marketing Customer Intel We have reps managing many sources of info on leads and customers. Our view of the buyer is fragmented and that makes us less effective. We're losing pipeline. Security & Compliance Fraud We need to track a person or device that is hiding its tracks. Connecting the dots is a laborious process and we can't keep up with our incident backlog. Military, IC, Law Surveillance We need to track a person or device that is hiding its identity. Our timely success is critical to public safety and national security. Privacy Compliance GDPR We must find and manage all PII to respond to inquiries. Failure to comply risks fines of €20 million or 4% annual turnover. IT MDM MDM is a slow and bureaucratic process. We can solve our own data quality problems faster and better. And we still need query time entity resolution. 3 Examples
  • 5. 4 Why is identity hard to track?
  • 6. Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 5 1. Identity is Vague Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Icons by icons8
  • 7. Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Alison Jones-Smith 555 Brooad Street XYZ Tech 3030 5500 9999 0000 2025559867 6 2. Identity Changes Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allison Smith 555 Broad St XYZ Technology Corp. 3030 5050 9999 0000 202-555-9876 Icons by icons8
  • 8. Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Alison Jones-Smith 555 Brooad Street XYZ Tech 3030 5500 9999 0000 2025559867 7 3. Identity is Messy Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allison Smith 555 Broad St XYZ Technology Corp. 3030 5050 9999 0000 202-555-9876 Icons by icons8
  • 9. 8 4. Identity is Diverse Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Alison Jones-Smith 555 Brooad Street XYZ Tech 3030 5500 9999 0000 2025559867 Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allison Smith 555 Broad St XYZ Technology Corp. 3030 5050 9999 0000 202-555-9876 ??? ??? ??? ??? Icons by icons8
  • 10. 9 Entity Resolution connects the dots despite these challenges
  • 11. Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allie Jones 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234 Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Allie Jones 132 W Main Street ABC Widgets 4040 0000 2020 8080 202 555 1234 Allie Smith 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allie Smith 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234 Ali Smith 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Allie Smith 555 Broad St ABC Widgets, Inc 4040 0000 2020 8080 202-555-1234 Allie Smith 555 Broad Street XYZ Tech Corp 3030 5050 9999 0000 202.555.1234 Allie Smith 555 Broad Street XYZ Technology Corp 3030 5050 9999 0000 202-555-9876 10 Comparison to Search Search Resolution name:"Allie Jones" AND street:"123 Main St" name:"Allie Jones" AND street:"123 Main St" Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allie Jones 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234 Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Ali Jones 132 Mane Street ABC Widgets 4024 0071 4970 1227 888-555-5555 Aly Jonas 113 Main Street Acme Corp. 4716 1035 4536 4671 610-555-5555 Allie Jones 132 W Main Street ABC Widgets 4040 0000 2020 8080 202-555-9876 Al Jones 132 E Main St Mom & Pop, LLC 3772 733741 52501 1-610-555-0000 Aly Jones 113 Main St, #102 Acme Corp. 4716 1035 4536 4671 610-555-5555 Ali Jones 132 Mane Street ABC Widgets 4024 0071 4970 1227 888-555-1234 Aly Jonas 113 Main Street Acme Corp. 4781 9105 0533 4481 610-555-2345 Allie Johns 132 W Main Street ABC Widgets 4088 0110 2044 8180 202-555-3456 Elle Jeon 132 E Main St Mom & Pop, LLC 3502 730741 52203 1-610-555-4567 Elle Jones 113 Main St, #102 Acme Corp. 4716 1035 4536 4671 610-555-5678 Eli Jones 132 Mane Street ABC Widgets 4224 0065 4800 1337 888-555-6789 Eli Joans 113 Main Street Acme Corp. 4206 1035 4536 4081 610-555-7890 Allie Jeans 132 N Mean Street ABC Widgets 4240 0101 02020 8888 202-555-8901 Search engine ranks results once. True hits mixed with noise. Search engine filters results recursively. True hits isolated and transitively linked.
  • 13. 12 Batch vs. Real-Time Batch Real-Time How is it used? Resolve all entities in advance (Partitioning, pairwise scoring, connected components) How long does it take? Docs + (Docs/Partitions)2 + Components2 (Hours for billions of documents) When is it necessary? Population or network analysis Most solutions have a real-time phase, sometimes applied after batch resolution. How is it used? Resolve one entity on query (Recursive Boolean query) How long does it take? Indices * Attributes * Hops (Milliseconds for a handful of each) When is it necessary? Individual analysis
  • 14. Robust matching • Token normalization • Phonetic matching • Fuzzy transpositions • Boolean logic filtering • Fine-tune search parameters 13 Real-Time Why Elasticsearch Suited for operations • Horizontal scaling • Real-time response rates • Flexible index mappings
  • 15. 14 Approach • Fast – Get results in real-time. From milliseconds to low seconds. • Generic – Resolve any type of entity. People, companies, locations, sessions, etc. • Transitive – Resolve over multiple hops of matches. Capture changing identities. • Multi-source – Resolve over multiple indices with disparate mappings. • Accommodating – Operate on data as it exists. Avoid transforming and reindexing data. • Logical – Logic is easier to read, troubleshoot, and optimize than statistics. • 100% Elasticsearch – Operate within existing search infrastructure. Goals
  • 16. 15 Approach 1. Entity modeling – What is the entity? What are its attributes? 2. Analyzers – How are you indexing each attribute? 3. Matchers – What is the query logic for each attribute? 4. Resolvers – What combinations of matching attributes imply a resolution? 5. Metadata maps – Which matchers apply to which indexed fields? 6. Recursive queries – How to repeat the queries until completion? Steps
  • 17. 16 zentity zentity.io An open source Elasticsearch plugin for real-time entity resolution
  • 18. zentity zentity.io An open source Elasticsearch plugin for real-time entity resolution 17 POST _zentity/resolution/person { "attributes": { "name": "Alice Jones", "dob": "1984-01-01", "phone": [ "555-123-4567", "555-987-6543" ] } }
  • 20. 19 Demos Customer intelligence Gather everything we know about a customer. Web traffic sessionization Track a bot that cycles through IP addresses, cookies, and user agent signatures. Fraud detection Determine if a health care provider was blacklisted under a different name.
  • 24. 23 Step 1. Entity Modeling Person Name the entity type. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Define its attributes. Study them in your data sets. Uniqueness Consistency Presence Moderate Moderate High Low Low Low Low Moderate Moderate High High Extreme Extreme Moderate Moderate Low Moderate High High High High Moderate Extreme Extreme Extreme High Extreme High Moderate Moderate High High High Moderate Moderate Moderate Low Low None Icons by icons8
  • 25. 24 Step 1. Entity Modeling Person Name the entity type. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Define its attributes. Study them in your data sets. Uniqueness Consistency Presence Moderate Moderate High Low Low Low Low Moderate Moderate High High Extreme Extreme Moderate Moderate Low Moderate High High High High Moderate Extreme Extreme Extreme High Extreme High Moderate Moderate High High High Moderate Moderate Moderate Low Low None This model is independent from your indices. You can reuse and extend this model as you add or amend indices. Icons by icons8
  • 26. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Phonetic "Alice Jones" => ["ALAC","JAN"] Standard "Alice Jones" => ["ALICE","JONES"] 25 Step 2. Analyzers Take the attributes. Define their analyzers. Put them in your index mappings. { "settings": { "index": { "analysis": { "filter": { "phonetic": { "type": "phonetic", "encoder": "nysiis" } }, "analyzer": { "phonetic": { "filter": [ "icu_normalizer", "icu_folding", "phonetic" ], "tokenizer": "standard" } } } } } } { "mappings": { "_doc": { "properties": { “first_name": { "type": "text", "fields": { "phonetic": { "type": "text", "analyzer": "phonetic" } } } } } } } Person Icons by icons8
  • 27. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Phonetic "Alice Jones" => ["ALAC","JAN"] Standard "Alice Jones" => ["ALICE","JONES"] 26 Step 2. Analyzers Take the attributes. Define their analyzers. Put them in your index mappings. { "settings": { "index": { "analysis": { "filter": { "phonetic": { "type": "phonetic", "encoder": "nysiis" } }, "analyzer": { "phonetic": { "filter": [ "icu_normalizer", "icu_folding", "phonetic" ], "tokenizer": "standard" } } } } } } { "mappings": { "_doc": { "properties": { “first_name": { "type": "text", "fields": { "phonetic": { "type": "text", "analyzer": "phonetic" } } } } } } } Person Analyzers are powerful. But they must be defined prior to indexing. Give careful thought to your analyzers to avoid having to reindex data. Icons by icons8
  • 28. Phonetic { "match": { "{{ field }}": { "query": "{{ value }}", "fuzziness": 0 } } } Standard { "match": { "{{ field }}": { "query": "{{ value }}", "fuzziness": 2 } } } 27 Step 3. Matchers Take the attributes. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Define their Boolean query logic. Use templates for variables. Person {{ field }} – The field of an index. {{ value }} – The value of an attribute. We will replace these at query time. Icons by icons8
  • 29. Phonetic { "match": { "{{ field }}": { "query": "{{ value }}“, "fuzziness": 0 } } } Standard { "match": { "{{ field }}": { "query": "{{ value }}“, "fuzziness": 2 } } } 28 Step 3. Matchers Take the attributes. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Define their Boolean query logic. Use templates for variables. Person {{ field }} – The field of an index. {{ value }} – The value of an attribute. We will replace these at query time. Understand that each matcher will be combined into one large Boolean query. Icons by icons8
  • 30. 29 Step 4. Resolvers Take the attributes. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Determine which combinations of matching attributes imply a resolution. [ Name – First, Name – Last, Address – Street, Address – City, Address – State ] [ Name – First, Name – Last, Address – Street, Address – Postal Code ] [ Name – First, Name – Last, Date of Birth, Address – City, Address – State ] [ Name – First, Name – Last, Date of Birth, Address – Postal Code ] [ Name – First, Name – Last, Phone Number ] [ Name – First, Name – Last, Email Address ] [ Name – First, Name – Last, IP Address ] [ Name – First, Name – Last, Credit Card Number ] [ Name – First, Name – Last, Social Security Number] [ Email Address, Phone Number ] [ Email Address, IP Address ] [ Email Address, Credit Card Number ] [ IP Address, Credit Card Number ] Person Icons by icons8
  • 31. 30 Step 4. Resolvers Take the attributes. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Determine which combinations of matching attributes imply a resolution. [ Name – First, Name – Last, Address – Street, Address – City, Address – State ] [ Name – First, Name – Last, Address – Street, Address – Postal Code ] [ Name – First, Name – Last, Date of Birth, Address – City, Address – State ] [ Name – First, Name – Last, Date of Birth, Address – Postal Code ] [ Name – First, Name – Last, Phone Number ] [ Name – First, Name – Last, Email Address ] [ Name – First, Name – Last, IP Address ] [ Name – First, Name – Last, Credit Card Number ] [ Name – First, Name – Last, Social Security Number] [ Email Address, Phone Number ] [ Email Address, IP Address ] [ Email Address, Credit Card Number ] [ IP Address, Credit Card Number ] Person Avoid resolving on a single attribute such as Social Security Number. Corroboration among multiple attributes helps prevent snowballs. Icons by icons8
  • 32. 31 Step 5. Metadata Maps Take the attributes. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Map them to the fields of the relevant indices. users.first_name users.last_name users.phone users.email customers:fname customers:lname customers:tel customers:email customers:cc customers:zip Person Icons by icons8
  • 33. 32 Step 6. Recursive Queries With each query, new inputs might be found in different attributes. Use the metadata map and your resolvers to determine if you can create new queries for the new inputs.