Entity Resolution Using Patient Records at CMMI

Entity Resolution
Using Patient Records
at CMMI
Brett Luskin
Data Scientist at NewWave

Agenda
§ Introductions
§ Entity Resolution at CMMI
§ Question & Answer

Introductions:
Brett Luskin
§ Data Scientist at NewWave
§ Specializing in Entity Resolution
§ Has a strange looking dog
§ Full-service health IT firm
§ 15+ years serving CMS
§ Led entity resolution innovation at CMMI
§ Problem-solving with NLP and ML

Luke Bilbro
§ Data Scientist at Databricks
§ Entity Resolution Lead Architect
§ Virtuoso bongo player
Qing Sun
Introductions:
§ Data Scientist at Databricks
§ Databricks Solutions Expert
§ Thought about climbing Everest

Entity Resolution at the
CMS Innovation Center

§ Data drives everything.
§ NewWave was tasked with building out
one of the most important health data
ecosystems in the country. (CMMI)
§ NewWave led deployment of
Databricks-enabled Entity Resolution at
nation’s largest healthcare payer (CMS)
Key Data NewWave Manages:
§ (CMS) Integrated Data Repository (IDR)
§ (CCW) Chronic Condition Warehouse
Government + Innovation Center
No one else has
access to the
quality + quantity
of health data that
NewWave does.
CMS is the
largest, single-
payer
for healthcare
services in the US.
NewWave is at the
center of a healthcare
revolution at CMMI

Government + Innovation Center
§ A “Model” is a constellation of
people, processes, policy and
technology: an experiment in
how to transform healthcare
§ The use cases defined by CMMI
models are built with best-in-
class technology with the
capacity to scale nationwide.

What is Entity Resolution (ER)?
§ Have we seen this person/business/item before?
§ How many distinct patients are there?
§ Are these similar sounding things the same thing?
§ What is the “Golden Record” for this entity?
§ Entity Resolution is the task of disambiguating records that correspond to real world
entities across and within datasets.
§ Deduplication
§ Record linking
§ Canonicalization

What is Entity Resolution (ER)?
From a Data Science perspective
§ Complexity reduction – clustering
§ Matching records – classification
§ Resolving matches to entities – graph computing

The Scaling Challenge
name
Danielle Dickson
Jennifer Morales
Nicole Fernandez
Dickson, D.
id1 name1 id2 name2
1 Danielle Dickson 2 Jennifer Morales
1 Danielle Dickson 3 Nicole Fernandez
1 Danielle Dickson 4 Dickson, D.
2 Jennifer Morales 3 Nicole Fernandez
2 Jennifer Morales 4 Dickson, D.
3 Nicole Fernandez 4 Dickson, D.

The Scaling Challenge
§ 4 records = 6 possible pairs
§ 4 million records = 8 trillion possible pairs
§ To process 1 million pairs per second,
it would take about 3 months to finish

Blocking
§ Reducing the number of comparisons
made by rule or algorithm
§ Dependent on data type
Matching Pairs of
Records Set of records meeting
blocking rule/algo
“Candidate Pairs”
Set of all pairs of
records
Reference: Getoor, Machanavajjhala
https://users.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf

MinHash LSH
§ Shingling à Splitting/Tokenizing text
§ Min Hashing
§ Locality Sensitive Hashing (LSH)
§ Most Hashing algorithms: small changes in input à large change in
output
§ LSH: small changes in input à small changes in output

Shingling
§ Danielle Dickson
§ ‘Dan’, ‘ani’, ‘nie’,
‘iel’, … ‘son’
§ Dani Dickson
§ ‘Dan’, ‘ani’, ‘ni_’,
‘i_D’, … ‘son’
Jaccard similarity:
!"#$%&$'#!("
)"!("

Shingling
Jaccard similarity:
*
+,
Token Value_1 Value_2
0 Dan Dan
1 ani ani
2 nie null
3 iel null
4 ell null
5 lle null
6 le_ null
7 e_D null
8 _Di _Di
9 Dic Dic
10 ick ick
11 cks cks
12 kso kso
13 son son
14 null ni_
15 null i_D
• What is the distribution of scores when
using different parameters?
• What does the ideal distribution look like?

MinHash
§ Short integer signature for each text
§ Signature is permutations of shingle index
Token Value_1 Value_2
0 Dan Dan
1 ani ani
2 nie null
Index Value_1 Value_2
0 1 1
1 1 1
2 1 0
Index Value_1 Value_2
2 1 0
0 1 1
1 1 1
Value_1 Signature: {0,2}
Signature Matrix:
{0, 0
2, 0}

MinHash LSH
§ Matrix of signatures is divided into bands
§ Bands are hashed and results are put into “buckets” which have candidate pairs
Signature Matrix:
{0, 0
2, 0}
Like bands are
considered in same
bucket

Scoring Pairs
§ Minhash LSH is done feature-wise for each record
§ Scored based on data type
§ SSN/Driver’s License
§ Name
§ Address

Classification
§ Class imbalance
§ Gradient Boosted Trees
§ Featurewise similarity scores as input
§ Match/Not Match as output

Resolving Entities
§ Graphframes connected components
§ Two inputs:
§ Dataframe of nodes – unique patient IDs
§ Dataframe of edges – two ID columns and match probability (0-
1)
§ Transitive linking can create difficult decisions
§ A – B : 0.9
§ B – C : 0.95
§ A – C : 0.3

Data Challenges in Healthcare
§ PHI/PII requires synthetic data for testing
§ PCRs vs claims data
§ Records can be updated
§ Data quality affects blocking

Synthetic Data
first_name last_name address city state zip_code
Danielle Dickson 229 Schroeder Bridge Greenburgh null 93539
Danielle Dickson null Greenburgh null 93539
Danielle Dickson null Greenburgh Texas 95339
null Dickson null Greenburgh Texas 93539
Danielle Dickson 229 Schroeder Bridge Greenburgh Texas null
Danielle Dickson null rGeenburgh Texas 93539
Danielle Dickson 229 Schroeder Bridge null Texas 93539
Jennifer Morales 808 Nancy Burg Jenniferport Mississippi null
Jennifer null null Jenniferport Mississippi 47037
Morales Jennifer 808 Nancy Burg Jenniferport null 47037
null Jennifer 808 Nancy Burg Jenniferport Mississippi 47037
Morales null 808 Nancy Burg Jenniferport Mississippi 47037
Fernandez Nicole 221 Day Port # 262 North Keithfurt null null
null Fernandez 221 Day Port # 262 null Vermont 73334
Nicole Fernandez 221 Day Port uStie 262 North Keithfurt null 73334
Nicole Fernandez 221 Day Port Suite 262 North Keithfurt Vermont 73334
Nicole Fernandez 221 Day Port Suite 262 North Keithfurt Vermont null
Fernandez Nicole 221 Day Port # 262 oNrth Ketihfurt null 73334
Nulls
Typos
Inversions
Replacements

Synthetic Data
§ Naïve Approach: compare all records
§ 300,000 records à 45 Billion candidate pairs
§ Blocking Approach:
§ 4.8 million candidate pairs

Production Data
§ Naïve Approach: compare all records
§ 459,460 records à 105 Billion candidate pairs
§ Blocking Approach:
§ 4.1 million candidate pairs – (is this good?)

Entity Resolution Using Patient Records at CMMI

More Related Content

Similar to Entity Resolution Using Patient Records at CMMI

More from Databricks

Recently uploaded

Entity Resolution Using Patient Records at CMMI