SlideShare a Scribd company logo
1 of 33
Dave Moore
david.moore@elastic.co
Haystack: The Search Relevancy Conference (11 April 2018)
Real-Time Entity Resolution
With Elasticsearch
1
Disambiguation
Entity Entity
Single attributes in unstructured text
"Named Entity Recognition"
Multiple attributes in structured data
"Entity Resolution"
vs.
Person
Field Value
Name Alice Jones
DOB 1984-01-01
Street 123 Main St
Credit Card 4040 0000 2020 8080
Phone 202-555-1234
2
What is entity resolution?
Health Care
Patient ID
We need to identify
and their medical
many hand-written
Mixing up records puts
at risk of injury or
Sales & Marketing
Customer Intel
We have reps
managing many
sources of info on
leads and customers.
Our view of the buyer
is fragmented and that
makes us less
effective. We're losing
pipeline.
Security & Compliance
Fraud
We need to track a
person or device that is
hiding its tracks.
Connecting the dots is
a
laborious process and
we can't keep up with
our incident backlog.
Military, IC, Law
Surveillance
We need to track a
person or device that is
hiding its identity. Our
timely success is
critical to public safety
and national security.
Privacy Compliance
GDPR
We must find and
manage all PII to
respond to inquiries.
Failure to comply risks
fines of €20 million or
4% annual turnover.
IT
MDM
MDM is a slow and
bureaucratic process.
We can solve our own
data quality problems
faster and better. And
we still need query
time entity resolution.
3
Examples
4
Why is identity hard to track?
Ali Jones
123 W Main Street
ABC Wigdets
4040 0000 2020 8008
+1 (202) 555 1234
5
1. Identity is Vague
Allie Jones
123 Main St
ABC Widgets, Inc.
4040 0000 2020 8080
202-555-1234
Icons by icons8
Ali Jones
123 W Main Street
ABC Wigdets
4040 0000 2020 8008
+1 (202) 555 1234
Alison Jones-Smith
555 Brooad Street
XYZ Tech
3030 5500 9999 0000
2025559867
6
2. Identity Changes
Allie Jones
123 Main St
ABC Widgets, Inc.
4040 0000 2020 8080
202-555-1234
Allison Smith
555 Broad St
XYZ Technology Corp.
3030 5050 9999 0000
202-555-9876
Icons by icons8
Ali Jones
123 W Main Street
ABC Wigdets
4040 0000 2020 8008
+1 (202) 555 1234
Alison Jones-Smith
555 Brooad Street
XYZ Tech
3030 5500 9999 0000
2025559867
7
3. Identity is Messy
Allie Jones
123 Main St
ABC Widgets, Inc.
4040 0000 2020 8080
202-555-1234
Allison Smith
555 Broad St
XYZ Technology Corp.
3030 5050 9999 0000
202-555-9876
Icons by icons8
8
4. Identity is Diverse
Ali Jones
123 W Main Street
ABC Wigdets
4040 0000 2020 8008
+1 (202) 555 1234
Alison Jones-Smith
555 Brooad Street
XYZ Tech
3030 5500 9999 0000
2025559867
Allie Jones
123 Main St
ABC Widgets, Inc.
4040 0000 2020 8080
202-555-1234
Allison Smith
555 Broad St
XYZ Technology Corp.
3030 5050 9999 0000
202-555-9876
???
???
???
???
Icons by icons8
9
Entity Resolution
connects the dots despite these challenges
Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234
Allie Jones 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234
Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234
Allie Jones 132 W Main Street ABC Widgets 4040 0000 2020 8080 202 555 1234
Allie Smith 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234
Allie Smith 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234
Ali Smith 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234
Allie Smith 555 Broad St ABC Widgets, Inc 4040 0000 2020 8080 202-555-1234
Allie Smith 555 Broad Street XYZ Tech Corp 3030 5050 9999 0000 202.555.1234
Allie Smith 555 Broad Street XYZ Technology Corp 3030 5050 9999 0000 202-555-9876
10
Comparison to Search
Search Resolution
name:"Allie Jones" AND street:"123 Main St" name:"Allie Jones" AND street:"123 Main St"
Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234
Allie Jones 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234
Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234
Ali Jones 132 Mane Street ABC Widgets 4024 0071 4970 1227 888-555-5555
Aly Jonas 113 Main Street Acme Corp. 4716 1035 4536 4671 610-555-5555
Allie Jones 132 W Main Street ABC Widgets 4040 0000 2020 8080 202-555-9876
Al Jones 132 E Main St Mom & Pop, LLC 3772 733741 52501 1-610-555-0000
Aly Jones 113 Main St, #102 Acme Corp. 4716 1035 4536 4671 610-555-5555
Ali Jones 132 Mane Street ABC Widgets 4024 0071 4970 1227 888-555-1234
Aly Jonas 113 Main Street Acme Corp. 4781 9105 0533 4481 610-555-2345
Allie Johns 132 W Main Street ABC Widgets 4088 0110 2044 8180 202-555-3456
Elle Jeon 132 E Main St Mom & Pop, LLC 3502 730741 52203 1-610-555-4567
Elle Jones 113 Main St, #102 Acme Corp. 4716 1035 4536 4671 610-555-5678
Eli Jones 132 Mane Street ABC Widgets 4224 0065 4800 1337 888-555-6789
Eli Joans 113 Main Street Acme Corp. 4206 1035 4536 4081 610-555-7890
Allie Jeans 132 N Mean Street ABC Widgets 4240 0101 02020 8888 202-555-8901
Search engine ranks results once.
True hits mixed with noise.
Search engine filters results recursively.
True hits isolated and transitively linked.
11
Real-Time
12
Batch vs. Real-Time
Batch Real-Time
How is it used? Resolve all entities in advance
(Partitioning, pairwise scoring, connected
components)
How long does it take? Docs + (Docs/Partitions)2 + Components2
(Hours for billions of documents)
When is it necessary? Population or network analysis
Most solutions have a real-time phase,
sometimes applied after batch resolution.
How is it used? Resolve one entity on query
(Recursive Boolean query)
How long does it take? Indices * Attributes * Hops
(Milliseconds for a handful of each)
When is it necessary? Individual analysis
Robust matching
• Token normalization
• Phonetic matching
• Fuzzy transpositions
• Boolean logic filtering
• Fine-tune search parameters
13
Real-Time
Why Elasticsearch
Suited for operations
• Horizontal scaling
• Real-time response rates
• Flexible index mappings
14
Approach
• Fast – Get results in real-time. From milliseconds to low seconds.
• Generic – Resolve any type of entity. People, companies, locations, sessions, etc.
• Transitive – Resolve over multiple hops of matches. Capture changing identities.
• Multi-source – Resolve over multiple indices with disparate mappings.
• Accommodating – Operate on data as it exists. Avoid transforming and reindexing
data.
• Logical – Logic is easier to read, troubleshoot, and optimize than statistics.
• 100% Elasticsearch – Operate within existing search infrastructure.
Goals
15
Approach
1. Entity modeling – What is the entity? What are its attributes?
2. Analyzers – How are you indexing each attribute?
3. Matchers – What is the query logic for each attribute?
4. Resolvers – What combinations of matching attributes imply a resolution?
5. Metadata maps – Which matchers apply to which indexed fields?
6. Recursive queries – How to repeat the queries until completion?
Steps
An open source Elasticsearch plugin
for real-time entity resolution
16
zentity.io
17
POST _zentity/resolution/person
{
"attributes": {
"name": "Alice Jones",
"dob": "1984-01-01",
"phone": [ "555-123-4567", "555-987-6543" ]
}
}
zentity.io
18
Demos
19
Demos
Customer intelligence
Gather everything we know about a customer.
Web traffic sessionization
Track a bot that cycles through IP addresses, cookies, and user agent signatures.
Fraud detection
Determine if a health care provider was blacklisted under a different name.
Dave Moore
email: david.moore@elastic.co
zentity: zentity.io
Contact
@elastic
www.elastic.co
Extra Content
22
Approach
23
Step 1. Entity Modeling
Person
Name the entity type.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Define its attributes. Study them in your data sets.
Uniqueness Consistency Presence
Moderate
Moderate
High
Low
Low
Low
Low
Moderate
Moderate
High
High
Extreme
Extreme
Moderate
Moderate
Low
Moderate
High
High
High
High
Moderate
Extreme
Extreme
Extreme
High
Extreme
High
Moderate
Moderate
High
High
High
Moderate
Moderate
Moderate
Low
Low
None
Icons by icons8
24
Step 1. Entity Modeling
Person
Name the entity type.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Define its attributes. Study them in your data sets.
Uniqueness Consistency Presence
Moderate
Moderate
High
Low
Low
Low
Low
Moderate
Moderate
High
High
Extreme
Extreme
Moderate
Moderate
Low
Moderate
High
High
High
High
Moderate
Extreme
Extreme
Extreme
High
Extreme
High
Moderate
Moderate
High
High
High
Moderate
Moderate
Moderate
Low
Low
None
This model is independent from your indices.
You can reuse and extend this model as you add or amend indices.
Icons by icons8
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Phonetic
"Alice Jones" => ["ALAC","JAN"]
Standard
"Alice Jones" => ["ALICE","JONES"]
25
Step 2. Analyzers
Take the attributes. Define their analyzers. Put them in your index mappings.
{
"settings": {
"index": {
"analysis": {
"filter": {
"phonetic": {
"type": "phonetic",
"encoder": "nysiis"
}
},
"analyzer": {
"phonetic": {
"filter": [
"icu_normalizer",
"icu_folding",
"phonetic"
],
"tokenizer": "standard"
}
}
}
}
}
}
{
"mappings": {
"_doc": {
"properties": {
“first_name": {
"type": "text",
"fields": {
"phonetic": {
"type": "text",
"analyzer": "phonetic"
}
}
}
}
}
}
}
Person
Icons by icons8
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Phonetic
"Alice Jones" => ["ALAC","JAN"]
Standard
"Alice Jones" => ["ALICE","JONES"]
26
Step 2. Analyzers
Take the attributes. Define their analyzers. Put them in your index mappings.
{
"settings": {
"index": {
"analysis": {
"filter": {
"phonetic": {
"type": "phonetic",
"encoder": "nysiis"
}
},
"analyzer": {
"phonetic": {
"filter": [
"icu_normalizer",
"icu_folding",
"phonetic"
],
"tokenizer": "standard"
}
}
}
}
}
}
{
"mappings": {
"_doc": {
"properties": {
“first_name": {
"type": "text",
"fields": {
"phonetic": {
"type": "text",
"analyzer": "phonetic"
}
}
}
}
}
}
}
Person
Analyzers are powerful. But they must be defined prior to indexing.
Give careful thought to your analyzers to avoid having to reindex data.
Icons by icons8
Phonetic
{
"match": {
"{{ field }}": {
"query": "{{ value }}",
"fuzziness": 0
}
}
}
Standard
{
"match": {
"{{ field }}": {
"query": "{{ value }}",
"fuzziness": 2
}
}
}
27
Step 3. Matchers
Take the attributes.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Define their Boolean query logic. Use templates for variables.
Person
{{ field }} – The field of an index.
{{ value }} – The value of an attribute.
We will replace these at query time.
Icons by icons8
Phonetic
{
"match": {
"{{ field }}": {
"query": "{{ value }}“,
"fuzziness": 0
}
}
}
Standard
{
"match": {
"{{ field }}": {
"query": "{{ value }}“,
"fuzziness": 2
}
}
}
28
Step 3. Matchers
Take the attributes.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Define their Boolean query logic. Use templates for variables.
Person
{{ field }} – The field of an index.
{{ value }} – The value of an attribute.
We will replace these at query time.
Understand that each matcher will be combined
into one large Boolean query.
Icons by icons8
29
Step 4. Resolvers
Take the attributes.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Determine which combinations of matching attributes imply a resolution.
[ Name – First, Name – Last, Address – Street, Address – City, Address – State ]
[ Name – First, Name – Last, Address – Street, Address – Postal Code ]
[ Name – First, Name – Last, Date of Birth, Address – City, Address – State ]
[ Name – First, Name – Last, Date of Birth, Address – Postal Code ]
[ Name – First, Name – Last, Phone Number ]
[ Name – First, Name – Last, Email Address ]
[ Name – First, Name – Last, IP Address ]
[ Name – First, Name – Last, Credit Card Number ]
[ Name – First, Name – Last, Social Security Number]
[ Email Address, Phone Number ]
[ Email Address, IP Address ]
[ Email Address, Credit Card Number ]
[ IP Address, Credit Card Number ]
Person
Icons by icons8
30
Step 4. Resolvers
Take the attributes.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Determine which combinations of matching attributes imply a resolution.
[ Name – First, Name – Last, Address – Street, Address – City, Address – State ]
[ Name – First, Name – Last, Address – Street, Address – Postal Code ]
[ Name – First, Name – Last, Date of Birth, Address – City, Address – State ]
[ Name – First, Name – Last, Date of Birth, Address – Postal Code ]
[ Name – First, Name – Last, Phone Number ]
[ Name – First, Name – Last, Email Address ]
[ Name – First, Name – Last, IP Address ]
[ Name – First, Name – Last, Credit Card Number ]
[ Name – First, Name – Last, Social Security Number]
[ Email Address, Phone Number ]
[ Email Address, IP Address ]
[ Email Address, Credit Card Number ]
[ IP Address, Credit Card Number ]
Person
Avoid resolving on a single attribute such as Social Security Number.
Corroboration among multiple attributes helps prevent snowballs.
Icons by icons8
31
Step 5. Metadata Maps
Take the attributes.
Name – First Name
Name – Last Name
Address – Street
Address – City
Address – Province
Address – Postal Code
Address – Country
Date of Birth
Phone Number
Email Address
IP Address
Credit Card Number
Social Security Number
Map them to the fields of the relevant indices.
users.first_name
users.last_name
users.phone
users.email
customers:fname
customers:lname
customers:tel
customers:email
customers:cc
customers:zip
Person
Icons by icons8
32
Step 6. Recursive Queries
With each query, new inputs might be found in different attributes.
Use the metadata map and your resolvers to determine if you can
create new queries for the new inputs.

More Related Content

What's hot

Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit
 
Redis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech TalkRedis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech TalkRed Hat Developers
 
Frame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine LearningFrame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine LearningDavid Stein
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machineAlexei Starovoitov
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Zhenxiao Luo
 
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorAlmost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorJean-François Gagné
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Dd and atomic ddl pl17 dublin
Dd and atomic ddl pl17 dublinDd and atomic ddl pl17 dublin
Dd and atomic ddl pl17 dublinStåle Deraas
 
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )SANG WON PARK
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
 

What's hot (20)

Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Redis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech TalkRedis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech Talk
 
Frame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine LearningFrame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine Learning
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorAlmost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
ASH and AWR on DB12c
ASH and AWR on DB12cASH and AWR on DB12c
ASH and AWR on DB12c
 
Dd and atomic ddl pl17 dublin
Dd and atomic ddl pl17 dublinDd and atomic ddl pl17 dublin
Dd and atomic ddl pl17 dublin
 
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 

Similar to Real-Time Entity Resolution with Elasticsearch - Haystack 2018

Real time entity resolution with elasticsearch - haystack 2018
Real time entity resolution with elasticsearch - haystack 2018Real time entity resolution with elasticsearch - haystack 2018
Real time entity resolution with elasticsearch - haystack 2018OpenSource Connections
 
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...Privitar
 
Privacy solutions decode2021_jon_oliver
Privacy solutions decode2021_jon_oliverPrivacy solutions decode2021_jon_oliver
Privacy solutions decode2021_jon_oliverJonathanOliver26
 
Global AI Bootcamp Singapore - Keynote
Global AI Bootcamp Singapore - KeynoteGlobal AI Bootcamp Singapore - Keynote
Global AI Bootcamp Singapore - KeynoteAlex Smith
 
How We Did It: The Case of the Credit Card Breach
How We Did It: The Case of the Credit Card BreachHow We Did It: The Case of the Credit Card Breach
How We Did It: The Case of the Credit Card BreachTeradata
 
Introduction of Artificial Intelligence
Introduction of Artificial IntelligenceIntroduction of Artificial Intelligence
Introduction of Artificial IntelligenceAkhileshwar Nirala
 
Mastering Location Data – a new paradigm in network analytics
Mastering Location Data – a new paradigm in network analyticsMastering Location Data – a new paradigm in network analytics
Mastering Location Data – a new paradigm in network analyticsPrecisely
 
Trusting AI with important decisions
Trusting AI with important decisionsTrusting AI with important decisions
Trusting AI with important decisionsLouis Dorard
 
The Domains of Identity & Self-Sovereign Identity MyData 2018
The Domains of Identity & Self-Sovereign Identity MyData 2018The Domains of Identity & Self-Sovereign Identity MyData 2018
The Domains of Identity & Self-Sovereign Identity MyData 2018Kaliya "Identity Woman" Young
 
All Clearances or Cyber Virtual Job Fair Handbook June 3, 2020, Huntsville
All Clearances or Cyber Virtual Job Fair Handbook June 3, 2020, HuntsvilleAll Clearances or Cyber Virtual Job Fair Handbook June 3, 2020, Huntsville
All Clearances or Cyber Virtual Job Fair Handbook June 3, 2020, HuntsvilleClearedJobs.Net
 
Huntsville All Clearances or Cyber Virtual Job Fair Handbook June 3
Huntsville All Clearances or Cyber Virtual Job Fair Handbook June 3Huntsville All Clearances or Cyber Virtual Job Fair Handbook June 3
Huntsville All Clearances or Cyber Virtual Job Fair Handbook June 3ClearedJobs.Net
 
What You Need to Know About Robotic Process Automation: How It Works & Real-W...
What You Need to Know About Robotic Process Automation: How It Works & Real-W...What You Need to Know About Robotic Process Automation: How It Works & Real-W...
What You Need to Know About Robotic Process Automation: How It Works & Real-W...Captricity
 
Database Design Disasters
Database Design DisastersDatabase Design Disasters
Database Design DisastersRichie Rump
 
Cybersecurity for Marketing
Cybersecurity for Marketing Cybersecurity for Marketing
Cybersecurity for Marketing Alert Logic
 
Self-Sovereign Identity: Lightening Talk at RightsCon
Self-Sovereign Identity: Lightening Talk at RightsCon Self-Sovereign Identity: Lightening Talk at RightsCon
Self-Sovereign Identity: Lightening Talk at RightsCon Kaliya "Identity Woman" Young
 
Curated Proof Markets & Token-Curated Identities in Ocean Protocol
Curated Proof Markets & Token-Curated Identities in Ocean ProtocolCurated Proof Markets & Token-Curated Identities in Ocean Protocol
Curated Proof Markets & Token-Curated Identities in Ocean ProtocolTrent McConaghy
 
TECHNOLOGY: Solution to our woos not Politicians & INTERNET of THINGS in Nuts...
TECHNOLOGY: Solution to our woos not Politicians & INTERNET of THINGS in Nuts...TECHNOLOGY: Solution to our woos not Politicians & INTERNET of THINGS in Nuts...
TECHNOLOGY: Solution to our woos not Politicians & INTERNET of THINGS in Nuts...Ravi Chandra
 
Snowflake Data Governance
Snowflake Data GovernanceSnowflake Data Governance
Snowflake Data Governancessuser538b022
 
Synthetic Data for Big Data Privacy
Synthetic Data for Big Data PrivacySynthetic Data for Big Data Privacy
Synthetic Data for Big Data PrivacyMOSTLY AI
 
How the US Military does Risk Management is a little different wha.docx
How the US Military does Risk Management is a little different wha.docxHow the US Military does Risk Management is a little different wha.docx
How the US Military does Risk Management is a little different wha.docxwellesleyterresa
 

Similar to Real-Time Entity Resolution with Elasticsearch - Haystack 2018 (20)

Real time entity resolution with elasticsearch - haystack 2018
Real time entity resolution with elasticsearch - haystack 2018Real time entity resolution with elasticsearch - haystack 2018
Real time entity resolution with elasticsearch - haystack 2018
 
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
 
Privacy solutions decode2021_jon_oliver
Privacy solutions decode2021_jon_oliverPrivacy solutions decode2021_jon_oliver
Privacy solutions decode2021_jon_oliver
 
Global AI Bootcamp Singapore - Keynote
Global AI Bootcamp Singapore - KeynoteGlobal AI Bootcamp Singapore - Keynote
Global AI Bootcamp Singapore - Keynote
 
How We Did It: The Case of the Credit Card Breach
How We Did It: The Case of the Credit Card BreachHow We Did It: The Case of the Credit Card Breach
How We Did It: The Case of the Credit Card Breach
 
Introduction of Artificial Intelligence
Introduction of Artificial IntelligenceIntroduction of Artificial Intelligence
Introduction of Artificial Intelligence
 
Mastering Location Data – a new paradigm in network analytics
Mastering Location Data – a new paradigm in network analyticsMastering Location Data – a new paradigm in network analytics
Mastering Location Data – a new paradigm in network analytics
 
Trusting AI with important decisions
Trusting AI with important decisionsTrusting AI with important decisions
Trusting AI with important decisions
 
The Domains of Identity & Self-Sovereign Identity MyData 2018
The Domains of Identity & Self-Sovereign Identity MyData 2018The Domains of Identity & Self-Sovereign Identity MyData 2018
The Domains of Identity & Self-Sovereign Identity MyData 2018
 
All Clearances or Cyber Virtual Job Fair Handbook June 3, 2020, Huntsville
All Clearances or Cyber Virtual Job Fair Handbook June 3, 2020, HuntsvilleAll Clearances or Cyber Virtual Job Fair Handbook June 3, 2020, Huntsville
All Clearances or Cyber Virtual Job Fair Handbook June 3, 2020, Huntsville
 
Huntsville All Clearances or Cyber Virtual Job Fair Handbook June 3
Huntsville All Clearances or Cyber Virtual Job Fair Handbook June 3Huntsville All Clearances or Cyber Virtual Job Fair Handbook June 3
Huntsville All Clearances or Cyber Virtual Job Fair Handbook June 3
 
What You Need to Know About Robotic Process Automation: How It Works & Real-W...
What You Need to Know About Robotic Process Automation: How It Works & Real-W...What You Need to Know About Robotic Process Automation: How It Works & Real-W...
What You Need to Know About Robotic Process Automation: How It Works & Real-W...
 
Database Design Disasters
Database Design DisastersDatabase Design Disasters
Database Design Disasters
 
Cybersecurity for Marketing
Cybersecurity for Marketing Cybersecurity for Marketing
Cybersecurity for Marketing
 
Self-Sovereign Identity: Lightening Talk at RightsCon
Self-Sovereign Identity: Lightening Talk at RightsCon Self-Sovereign Identity: Lightening Talk at RightsCon
Self-Sovereign Identity: Lightening Talk at RightsCon
 
Curated Proof Markets & Token-Curated Identities in Ocean Protocol
Curated Proof Markets & Token-Curated Identities in Ocean ProtocolCurated Proof Markets & Token-Curated Identities in Ocean Protocol
Curated Proof Markets & Token-Curated Identities in Ocean Protocol
 
TECHNOLOGY: Solution to our woos not Politicians & INTERNET of THINGS in Nuts...
TECHNOLOGY: Solution to our woos not Politicians & INTERNET of THINGS in Nuts...TECHNOLOGY: Solution to our woos not Politicians & INTERNET of THINGS in Nuts...
TECHNOLOGY: Solution to our woos not Politicians & INTERNET of THINGS in Nuts...
 
Snowflake Data Governance
Snowflake Data GovernanceSnowflake Data Governance
Snowflake Data Governance
 
Synthetic Data for Big Data Privacy
Synthetic Data for Big Data PrivacySynthetic Data for Big Data Privacy
Synthetic Data for Big Data Privacy
 
How the US Military does Risk Management is a little different wha.docx
How the US Military does Risk Management is a little different wha.docxHow the US Military does Risk Management is a little different wha.docx
How the US Military does Risk Management is a little different wha.docx
 

Recently uploaded

sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样jk0tkvfv
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxAniqa Zai
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...mikehavy0
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Voces Mineras
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSSnehalVinod
 
Pentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIPentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIf6x4zqzk86
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...LuisMiguelPaz5
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...ThinkInnovation
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 

Recently uploaded (20)

sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
 
Abortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotecAbortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotec
 
Pentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIPentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AI
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 

Real-Time Entity Resolution with Elasticsearch - Haystack 2018

  • 1. Dave Moore david.moore@elastic.co Haystack: The Search Relevancy Conference (11 April 2018) Real-Time Entity Resolution With Elasticsearch
  • 2. 1 Disambiguation Entity Entity Single attributes in unstructured text "Named Entity Recognition" Multiple attributes in structured data "Entity Resolution" vs. Person Field Value Name Alice Jones DOB 1984-01-01 Street 123 Main St Credit Card 4040 0000 2020 8080 Phone 202-555-1234
  • 3. 2 What is entity resolution?
  • 4. Health Care Patient ID We need to identify and their medical many hand-written Mixing up records puts at risk of injury or Sales & Marketing Customer Intel We have reps managing many sources of info on leads and customers. Our view of the buyer is fragmented and that makes us less effective. We're losing pipeline. Security & Compliance Fraud We need to track a person or device that is hiding its tracks. Connecting the dots is a laborious process and we can't keep up with our incident backlog. Military, IC, Law Surveillance We need to track a person or device that is hiding its identity. Our timely success is critical to public safety and national security. Privacy Compliance GDPR We must find and manage all PII to respond to inquiries. Failure to comply risks fines of €20 million or 4% annual turnover. IT MDM MDM is a slow and bureaucratic process. We can solve our own data quality problems faster and better. And we still need query time entity resolution. 3 Examples
  • 5. 4 Why is identity hard to track?
  • 6. Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 5 1. Identity is Vague Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Icons by icons8
  • 7. Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Alison Jones-Smith 555 Brooad Street XYZ Tech 3030 5500 9999 0000 2025559867 6 2. Identity Changes Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allison Smith 555 Broad St XYZ Technology Corp. 3030 5050 9999 0000 202-555-9876 Icons by icons8
  • 8. Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Alison Jones-Smith 555 Brooad Street XYZ Tech 3030 5500 9999 0000 2025559867 7 3. Identity is Messy Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allison Smith 555 Broad St XYZ Technology Corp. 3030 5050 9999 0000 202-555-9876 Icons by icons8
  • 9. 8 4. Identity is Diverse Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Alison Jones-Smith 555 Brooad Street XYZ Tech 3030 5500 9999 0000 2025559867 Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allison Smith 555 Broad St XYZ Technology Corp. 3030 5050 9999 0000 202-555-9876 ??? ??? ??? ??? Icons by icons8
  • 10. 9 Entity Resolution connects the dots despite these challenges
  • 11. Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allie Jones 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234 Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Allie Jones 132 W Main Street ABC Widgets 4040 0000 2020 8080 202 555 1234 Allie Smith 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allie Smith 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234 Ali Smith 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Allie Smith 555 Broad St ABC Widgets, Inc 4040 0000 2020 8080 202-555-1234 Allie Smith 555 Broad Street XYZ Tech Corp 3030 5050 9999 0000 202.555.1234 Allie Smith 555 Broad Street XYZ Technology Corp 3030 5050 9999 0000 202-555-9876 10 Comparison to Search Search Resolution name:"Allie Jones" AND street:"123 Main St" name:"Allie Jones" AND street:"123 Main St" Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allie Jones 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234 Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Ali Jones 132 Mane Street ABC Widgets 4024 0071 4970 1227 888-555-5555 Aly Jonas 113 Main Street Acme Corp. 4716 1035 4536 4671 610-555-5555 Allie Jones 132 W Main Street ABC Widgets 4040 0000 2020 8080 202-555-9876 Al Jones 132 E Main St Mom & Pop, LLC 3772 733741 52501 1-610-555-0000 Aly Jones 113 Main St, #102 Acme Corp. 4716 1035 4536 4671 610-555-5555 Ali Jones 132 Mane Street ABC Widgets 4024 0071 4970 1227 888-555-1234 Aly Jonas 113 Main Street Acme Corp. 4781 9105 0533 4481 610-555-2345 Allie Johns 132 W Main Street ABC Widgets 4088 0110 2044 8180 202-555-3456 Elle Jeon 132 E Main St Mom & Pop, LLC 3502 730741 52203 1-610-555-4567 Elle Jones 113 Main St, #102 Acme Corp. 4716 1035 4536 4671 610-555-5678 Eli Jones 132 Mane Street ABC Widgets 4224 0065 4800 1337 888-555-6789 Eli Joans 113 Main Street Acme Corp. 4206 1035 4536 4081 610-555-7890 Allie Jeans 132 N Mean Street ABC Widgets 4240 0101 02020 8888 202-555-8901 Search engine ranks results once. True hits mixed with noise. Search engine filters results recursively. True hits isolated and transitively linked.
  • 13. 12 Batch vs. Real-Time Batch Real-Time How is it used? Resolve all entities in advance (Partitioning, pairwise scoring, connected components) How long does it take? Docs + (Docs/Partitions)2 + Components2 (Hours for billions of documents) When is it necessary? Population or network analysis Most solutions have a real-time phase, sometimes applied after batch resolution. How is it used? Resolve one entity on query (Recursive Boolean query) How long does it take? Indices * Attributes * Hops (Milliseconds for a handful of each) When is it necessary? Individual analysis
  • 14. Robust matching • Token normalization • Phonetic matching • Fuzzy transpositions • Boolean logic filtering • Fine-tune search parameters 13 Real-Time Why Elasticsearch Suited for operations • Horizontal scaling • Real-time response rates • Flexible index mappings
  • 15. 14 Approach • Fast – Get results in real-time. From milliseconds to low seconds. • Generic – Resolve any type of entity. People, companies, locations, sessions, etc. • Transitive – Resolve over multiple hops of matches. Capture changing identities. • Multi-source – Resolve over multiple indices with disparate mappings. • Accommodating – Operate on data as it exists. Avoid transforming and reindexing data. • Logical – Logic is easier to read, troubleshoot, and optimize than statistics. • 100% Elasticsearch – Operate within existing search infrastructure. Goals
  • 16. 15 Approach 1. Entity modeling – What is the entity? What are its attributes? 2. Analyzers – How are you indexing each attribute? 3. Matchers – What is the query logic for each attribute? 4. Resolvers – What combinations of matching attributes imply a resolution? 5. Metadata maps – Which matchers apply to which indexed fields? 6. Recursive queries – How to repeat the queries until completion? Steps
  • 17. An open source Elasticsearch plugin for real-time entity resolution 16 zentity.io
  • 18. 17 POST _zentity/resolution/person { "attributes": { "name": "Alice Jones", "dob": "1984-01-01", "phone": [ "555-123-4567", "555-987-6543" ] } } zentity.io
  • 20. 19 Demos Customer intelligence Gather everything we know about a customer. Web traffic sessionization Track a bot that cycles through IP addresses, cookies, and user agent signatures. Fraud detection Determine if a health care provider was blacklisted under a different name.
  • 24. 23 Step 1. Entity Modeling Person Name the entity type. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Define its attributes. Study them in your data sets. Uniqueness Consistency Presence Moderate Moderate High Low Low Low Low Moderate Moderate High High Extreme Extreme Moderate Moderate Low Moderate High High High High Moderate Extreme Extreme Extreme High Extreme High Moderate Moderate High High High Moderate Moderate Moderate Low Low None Icons by icons8
  • 25. 24 Step 1. Entity Modeling Person Name the entity type. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Define its attributes. Study them in your data sets. Uniqueness Consistency Presence Moderate Moderate High Low Low Low Low Moderate Moderate High High Extreme Extreme Moderate Moderate Low Moderate High High High High Moderate Extreme Extreme Extreme High Extreme High Moderate Moderate High High High Moderate Moderate Moderate Low Low None This model is independent from your indices. You can reuse and extend this model as you add or amend indices. Icons by icons8
  • 26. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Phonetic "Alice Jones" => ["ALAC","JAN"] Standard "Alice Jones" => ["ALICE","JONES"] 25 Step 2. Analyzers Take the attributes. Define their analyzers. Put them in your index mappings. { "settings": { "index": { "analysis": { "filter": { "phonetic": { "type": "phonetic", "encoder": "nysiis" } }, "analyzer": { "phonetic": { "filter": [ "icu_normalizer", "icu_folding", "phonetic" ], "tokenizer": "standard" } } } } } } { "mappings": { "_doc": { "properties": { “first_name": { "type": "text", "fields": { "phonetic": { "type": "text", "analyzer": "phonetic" } } } } } } } Person Icons by icons8
  • 27. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Phonetic "Alice Jones" => ["ALAC","JAN"] Standard "Alice Jones" => ["ALICE","JONES"] 26 Step 2. Analyzers Take the attributes. Define their analyzers. Put them in your index mappings. { "settings": { "index": { "analysis": { "filter": { "phonetic": { "type": "phonetic", "encoder": "nysiis" } }, "analyzer": { "phonetic": { "filter": [ "icu_normalizer", "icu_folding", "phonetic" ], "tokenizer": "standard" } } } } } } { "mappings": { "_doc": { "properties": { “first_name": { "type": "text", "fields": { "phonetic": { "type": "text", "analyzer": "phonetic" } } } } } } } Person Analyzers are powerful. But they must be defined prior to indexing. Give careful thought to your analyzers to avoid having to reindex data. Icons by icons8
  • 28. Phonetic { "match": { "{{ field }}": { "query": "{{ value }}", "fuzziness": 0 } } } Standard { "match": { "{{ field }}": { "query": "{{ value }}", "fuzziness": 2 } } } 27 Step 3. Matchers Take the attributes. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Define their Boolean query logic. Use templates for variables. Person {{ field }} – The field of an index. {{ value }} – The value of an attribute. We will replace these at query time. Icons by icons8
  • 29. Phonetic { "match": { "{{ field }}": { "query": "{{ value }}“, "fuzziness": 0 } } } Standard { "match": { "{{ field }}": { "query": "{{ value }}“, "fuzziness": 2 } } } 28 Step 3. Matchers Take the attributes. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Define their Boolean query logic. Use templates for variables. Person {{ field }} – The field of an index. {{ value }} – The value of an attribute. We will replace these at query time. Understand that each matcher will be combined into one large Boolean query. Icons by icons8
  • 30. 29 Step 4. Resolvers Take the attributes. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Determine which combinations of matching attributes imply a resolution. [ Name – First, Name – Last, Address – Street, Address – City, Address – State ] [ Name – First, Name – Last, Address – Street, Address – Postal Code ] [ Name – First, Name – Last, Date of Birth, Address – City, Address – State ] [ Name – First, Name – Last, Date of Birth, Address – Postal Code ] [ Name – First, Name – Last, Phone Number ] [ Name – First, Name – Last, Email Address ] [ Name – First, Name – Last, IP Address ] [ Name – First, Name – Last, Credit Card Number ] [ Name – First, Name – Last, Social Security Number] [ Email Address, Phone Number ] [ Email Address, IP Address ] [ Email Address, Credit Card Number ] [ IP Address, Credit Card Number ] Person Icons by icons8
  • 31. 30 Step 4. Resolvers Take the attributes. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Determine which combinations of matching attributes imply a resolution. [ Name – First, Name – Last, Address – Street, Address – City, Address – State ] [ Name – First, Name – Last, Address – Street, Address – Postal Code ] [ Name – First, Name – Last, Date of Birth, Address – City, Address – State ] [ Name – First, Name – Last, Date of Birth, Address – Postal Code ] [ Name – First, Name – Last, Phone Number ] [ Name – First, Name – Last, Email Address ] [ Name – First, Name – Last, IP Address ] [ Name – First, Name – Last, Credit Card Number ] [ Name – First, Name – Last, Social Security Number] [ Email Address, Phone Number ] [ Email Address, IP Address ] [ Email Address, Credit Card Number ] [ IP Address, Credit Card Number ] Person Avoid resolving on a single attribute such as Social Security Number. Corroboration among multiple attributes helps prevent snowballs. Icons by icons8
  • 32. 31 Step 5. Metadata Maps Take the attributes. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Map them to the fields of the relevant indices. users.first_name users.last_name users.phone users.email customers:fname customers:lname customers:tel customers:email customers:cc customers:zip Person Icons by icons8
  • 33. 32 Step 6. Recursive Queries With each query, new inputs might be found in different attributes. Use the metadata map and your resolvers to determine if you can create new queries for the new inputs.