2. 1
Disambiguation
Entity Entity
Single attributes in unstructured text
"Named Entity Recognition"
Multiple attributes in structured data
"Entity Resolution"
vs.
Person
Field Value
Name Alice Jones
DOB 1984-01-01
Street 123 Main St
Credit Card 4040 0000 2020 8080
Phone 202-555-1234
4. Health Care
Patient ID
We need to identify
and their medical
many hand-written
Mixing up records puts
at risk of injury or
Sales & Marketing
Customer Intel
We have reps
managing many
sources of info on
leads and customers.
Our view of the buyer
is fragmented and that
makes us less
effective. We're losing
pipeline.
Security & Compliance
Fraud
We need to track a
person or device that is
hiding its tracks.
Connecting the dots is
a
laborious process and
we can't keep up with
our incident backlog.
Military, IC, Law
Surveillance
We need to track a
person or device that is
hiding its identity. Our
timely success is
critical to public safety
and national security.
Privacy Compliance
GDPR
We must find and
manage all PII to
respond to inquiries.
Failure to comply risks
fines of €20 million or
4% annual turnover.
IT
MDM
MDM is a slow and
bureaucratic process.
We can solve our own
data quality problems
faster and better. And
we still need query
time entity resolution.
3
Examples
6. Ali Jones
123 W Main Street
ABC Wigdets
4040 0000 2020 8008
+1 (202) 555 1234
5
1. Identity is Vague
Allie Jones
123 Main St
ABC Widgets, Inc.
4040 0000 2020 8080
202-555-1234
Icons by icons8
7. Ali Jones
123 W Main Street
ABC Wigdets
4040 0000 2020 8008
+1 (202) 555 1234
Alison Jones-Smith
555 Brooad Street
XYZ Tech
3030 5500 9999 0000
2025559867
6
2. Identity Changes
Allie Jones
123 Main St
ABC Widgets, Inc.
4040 0000 2020 8080
202-555-1234
Allison Smith
555 Broad St
XYZ Technology Corp.
3030 5050 9999 0000
202-555-9876
Icons by icons8
8. Ali Jones
123 W Main Street
ABC Wigdets
4040 0000 2020 8008
+1 (202) 555 1234
Alison Jones-Smith
555 Brooad Street
XYZ Tech
3030 5500 9999 0000
2025559867
7
3. Identity is Messy
Allie Jones
123 Main St
ABC Widgets, Inc.
4040 0000 2020 8080
202-555-1234
Allison Smith
555 Broad St
XYZ Technology Corp.
3030 5050 9999 0000
202-555-9876
Icons by icons8
9. 8
4. Identity is Diverse
Ali Jones
123 W Main Street
ABC Wigdets
4040 0000 2020 8008
+1 (202) 555 1234
Alison Jones-Smith
555 Brooad Street
XYZ Tech
3030 5500 9999 0000
2025559867
Allie Jones
123 Main St
ABC Widgets, Inc.
4040 0000 2020 8080
202-555-1234
Allison Smith
555 Broad St
XYZ Technology Corp.
3030 5050 9999 0000
202-555-9876
???
???
???
???
Icons by icons8
11. Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234
Allie Jones 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234
Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234
Allie Jones 132 W Main Street ABC Widgets 4040 0000 2020 8080 202 555 1234
Allie Smith 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234
Allie Smith 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234
Ali Smith 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234
Allie Smith 555 Broad St ABC Widgets, Inc 4040 0000 2020 8080 202-555-1234
Allie Smith 555 Broad Street XYZ Tech Corp 3030 5050 9999 0000 202.555.1234
Allie Smith 555 Broad Street XYZ Technology Corp 3030 5050 9999 0000 202-555-9876
10
Comparison to Search
Search Resolution
name:"Allie Jones" AND street:"123 Main St" name:"Allie Jones" AND street:"123 Main St"
Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234
Allie Jones 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234
Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234
Ali Jones 132 Mane Street ABC Widgets 4024 0071 4970 1227 888-555-5555
Aly Jonas 113 Main Street Acme Corp. 4716 1035 4536 4671 610-555-5555
Allie Jones 132 W Main Street ABC Widgets 4040 0000 2020 8080 202-555-9876
Al Jones 132 E Main St Mom & Pop, LLC 3772 733741 52501 1-610-555-0000
Aly Jones 113 Main St, #102 Acme Corp. 4716 1035 4536 4671 610-555-5555
Ali Jones 132 Mane Street ABC Widgets 4024 0071 4970 1227 888-555-1234
Aly Jonas 113 Main Street Acme Corp. 4781 9105 0533 4481 610-555-2345
Allie Johns 132 W Main Street ABC Widgets 4088 0110 2044 8180 202-555-3456
Elle Jeon 132 E Main St Mom & Pop, LLC 3502 730741 52203 1-610-555-4567
Elle Jones 113 Main St, #102 Acme Corp. 4716 1035 4536 4671 610-555-5678
Eli Jones 132 Mane Street ABC Widgets 4224 0065 4800 1337 888-555-6789
Eli Joans 113 Main Street Acme Corp. 4206 1035 4536 4081 610-555-7890
Allie Jeans 132 N Mean Street ABC Widgets 4240 0101 02020 8888 202-555-8901
Search engine ranks results once.
True hits mixed with noise.
Search engine filters results recursively.
True hits isolated and transitively linked.
13. 12
Batch vs. Real-Time
Batch Real-Time
How is it used? Resolve all entities in advance
(Partitioning, pairwise scoring, connected
components)
How long does it take? Docs + (Docs/Partitions)2 + Components2
(Hours for billions of documents)
When is it necessary? Population or network analysis
Most solutions have a real-time phase,
sometimes applied after batch resolution.
How is it used? Resolve one entity on query
(Recursive Boolean query)
How long does it take? Indices * Attributes * Hops
(Milliseconds for a handful of each)
When is it necessary? Individual analysis
15. 14
Approach
• Fast – Get results in real-time. From milliseconds to low seconds.
• Generic – Resolve any type of entity. People, companies, locations, sessions, etc.
• Transitive – Resolve over multiple hops of matches. Capture changing identities.
• Multi-source – Resolve over multiple indices with disparate mappings.
• Accommodating – Operate on data as it exists. Avoid transforming and reindexing
data.
• Logical – Logic is easier to read, troubleshoot, and optimize than statistics.
• 100% Elasticsearch – Operate within existing search infrastructure.
Goals
16. 15
Approach
1. Entity modeling – What is the entity? What are its attributes?
2. Analyzers – How are you indexing each attribute?
3. Matchers – What is the query logic for each attribute?
4. Resolvers – What combinations of matching attributes imply a resolution?
5. Metadata maps – Which matchers apply to which indexed fields?
6. Recursive queries – How to repeat the queries until completion?
Steps
20. 19
Demos
Customer intelligence
Gather everything we know about a customer.
Web traffic sessionization
Track a bot that cycles through IP addresses, cookies, and user agent signatures.
Fraud detection
Determine if a health care provider was blacklisted under a different name.
24. 23
Step 1. Entity Modeling
Person
Name the entity type.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Define its attributes. Study them in your data sets.
Uniqueness Consistency Presence
Moderate
Moderate
High
Low
Low
Low
Low
Moderate
Moderate
High
High
Extreme
Extreme
Moderate
Moderate
Low
Moderate
High
High
High
High
Moderate
Extreme
Extreme
Extreme
High
Extreme
High
Moderate
Moderate
High
High
High
Moderate
Moderate
Moderate
Low
Low
None
Icons by icons8
25. 24
Step 1. Entity Modeling
Person
Name the entity type.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Define its attributes. Study them in your data sets.
Uniqueness Consistency Presence
Moderate
Moderate
High
Low
Low
Low
Low
Moderate
Moderate
High
High
Extreme
Extreme
Moderate
Moderate
Low
Moderate
High
High
High
High
Moderate
Extreme
Extreme
Extreme
High
Extreme
High
Moderate
Moderate
High
High
High
Moderate
Moderate
Moderate
Low
Low
None
This model is independent from your indices.
You can reuse and extend this model as you add or amend indices.
Icons by icons8
26. Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Phonetic
"Alice Jones" => ["ALAC","JAN"]
Standard
"Alice Jones" => ["ALICE","JONES"]
25
Step 2. Analyzers
Take the attributes. Define their analyzers. Put them in your index mappings.
{
"settings": {
"index": {
"analysis": {
"filter": {
"phonetic": {
"type": "phonetic",
"encoder": "nysiis"
}
},
"analyzer": {
"phonetic": {
"filter": [
"icu_normalizer",
"icu_folding",
"phonetic"
],
"tokenizer": "standard"
}
}
}
}
}
}
{
"mappings": {
"_doc": {
"properties": {
“first_name": {
"type": "text",
"fields": {
"phonetic": {
"type": "text",
"analyzer": "phonetic"
}
}
}
}
}
}
}
Person
Icons by icons8
27. Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Phonetic
"Alice Jones" => ["ALAC","JAN"]
Standard
"Alice Jones" => ["ALICE","JONES"]
26
Step 2. Analyzers
Take the attributes. Define their analyzers. Put them in your index mappings.
{
"settings": {
"index": {
"analysis": {
"filter": {
"phonetic": {
"type": "phonetic",
"encoder": "nysiis"
}
},
"analyzer": {
"phonetic": {
"filter": [
"icu_normalizer",
"icu_folding",
"phonetic"
],
"tokenizer": "standard"
}
}
}
}
}
}
{
"mappings": {
"_doc": {
"properties": {
“first_name": {
"type": "text",
"fields": {
"phonetic": {
"type": "text",
"analyzer": "phonetic"
}
}
}
}
}
}
}
Person
Analyzers are powerful. But they must be defined prior to indexing.
Give careful thought to your analyzers to avoid having to reindex data.
Icons by icons8
28. Phonetic
{
"match": {
"{{ field }}": {
"query": "{{ value }}",
"fuzziness": 0
}
}
}
Standard
{
"match": {
"{{ field }}": {
"query": "{{ value }}",
"fuzziness": 2
}
}
}
27
Step 3. Matchers
Take the attributes.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Define their Boolean query logic. Use templates for variables.
Person
{{ field }} – The field of an index.
{{ value }} – The value of an attribute.
We will replace these at query time.
Icons by icons8
29. Phonetic
{
"match": {
"{{ field }}": {
"query": "{{ value }}“,
"fuzziness": 0
}
}
}
Standard
{
"match": {
"{{ field }}": {
"query": "{{ value }}“,
"fuzziness": 2
}
}
}
28
Step 3. Matchers
Take the attributes.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Define their Boolean query logic. Use templates for variables.
Person
{{ field }} – The field of an index.
{{ value }} – The value of an attribute.
We will replace these at query time.
Understand that each matcher will be combined
into one large Boolean query.
Icons by icons8
30. 29
Step 4. Resolvers
Take the attributes.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Determine which combinations of matching attributes imply a resolution.
[ Name – First, Name – Last, Address – Street, Address – City, Address – State ]
[ Name – First, Name – Last, Address – Street, Address – Postal Code ]
[ Name – First, Name – Last, Date of Birth, Address – City, Address – State ]
[ Name – First, Name – Last, Date of Birth, Address – Postal Code ]
[ Name – First, Name – Last, Phone Number ]
[ Name – First, Name – Last, Email Address ]
[ Name – First, Name – Last, IP Address ]
[ Name – First, Name – Last, Credit Card Number ]
[ Name – First, Name – Last, Social Security Number]
[ Email Address, Phone Number ]
[ Email Address, IP Address ]
[ Email Address, Credit Card Number ]
[ IP Address, Credit Card Number ]
Person
Icons by icons8
31. 30
Step 4. Resolvers
Take the attributes.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Determine which combinations of matching attributes imply a resolution.
[ Name – First, Name – Last, Address – Street, Address – City, Address – State ]
[ Name – First, Name – Last, Address – Street, Address – Postal Code ]
[ Name – First, Name – Last, Date of Birth, Address – City, Address – State ]
[ Name – First, Name – Last, Date of Birth, Address – Postal Code ]
[ Name – First, Name – Last, Phone Number ]
[ Name – First, Name – Last, Email Address ]
[ Name – First, Name – Last, IP Address ]
[ Name – First, Name – Last, Credit Card Number ]
[ Name – First, Name – Last, Social Security Number]
[ Email Address, Phone Number ]
[ Email Address, IP Address ]
[ Email Address, Credit Card Number ]
[ IP Address, Credit Card Number ]
Person
Avoid resolving on a single attribute such as Social Security Number.
Corroboration among multiple attributes helps prevent snowballs.
Icons by icons8
32. 31
Step 5. Metadata Maps
Take the attributes.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Map them to the fields of the relevant indices.
users.first_name
users.last_name
users.phone
users.email
customers:fname
customers:lname
customers:tel
customers:email
customers:cc
customers:zip
Person
Icons by icons8
33. 32
Step 6. Recursive Queries
With each query, new inputs might be found in different attributes.
Use the metadata map and your resolvers to determine if you can
create new queries for the new inputs.