Entity Resolution
Using Patient Records
at CMMI
Brett Luskin
Data Scientist at NewWave
Agenda
§ Introductions
§ Entity Resolution at CMMI
§ Question & Answer
Introductions:
Brett Luskin
§ Data Scientist at NewWave
§ Specializing in Entity Resolution
§ Has a strange looking dog
§ Full-service health IT firm
§ 15+ years serving CMS
§ Led entity resolution innovation at CMMI
§ Problem-solving with NLP and ML
Luke Bilbro
§ Data Scientist at Databricks
§ Entity Resolution Lead Architect
§ Virtuoso bongo player
Qing Sun
Introductions:
§ Data Scientist at Databricks
§ Databricks Solutions Expert
§ Thought about climbing Everest
Entity Resolution at the
CMS Innovation Center
§ Data drives everything.
§ NewWave was tasked with building out
one of the most important health data
ecosystems in the country. (CMMI)
§ NewWave led deployment of
Databricks-enabled Entity Resolution at
nation’s largest healthcare payer (CMS)
Key Data NewWave Manages:
§ (CMS) Integrated Data Repository (IDR)
§ (CCW) Chronic Condition Warehouse
Government + Innovation Center
No one else has
access to the
quality + quantity
of health data that
NewWave does.
CMS is the
largest, single-
payer
for healthcare
services in the US.
NewWave is at the
center of a healthcare
revolution at CMMI
Government + Innovation Center
§ A “Model” is a constellation of
people, processes, policy and
technology: an experiment in
how to transform healthcare
§ The use cases defined by CMMI
models are built with best-in-
class technology with the
capacity to scale nationwide.
Entity Resolution
What is Entity Resolution (ER)?
§ Have we seen this person/business/item before?
§ How many distinct patients are there?
§ Are these similar sounding things the same thing?
§ What is the “Golden Record” for this entity?
§ Entity Resolution is the task of disambiguating records that correspond to real world
entities across and within datasets.
§ Deduplication
§ Record linking
§ Canonicalization
What is Entity Resolution (ER)?
From a Data Science perspective
§ Complexity reduction – clustering
§ Matching records – classification
§ Resolving matches to entities – graph computing
The Scaling Challenge
name
Danielle Dickson
Jennifer Morales
Nicole Fernandez
Dickson, D.
id1 name1 id2 name2
1 Danielle Dickson 2 Jennifer Morales
1 Danielle Dickson 3 Nicole Fernandez
1 Danielle Dickson 4 Dickson, D.
2 Jennifer Morales 3 Nicole Fernandez
2 Jennifer Morales 4 Dickson, D.
3 Nicole Fernandez 4 Dickson, D.
The Scaling Challenge
§ 4 records = 6 possible pairs
§ 4 million records = 8 trillion possible pairs
§ To process 1 million pairs per second,
it would take about 3 months to finish
Blocking
§ Reducing the number of comparisons
made by rule or algorithm
§ Dependent on data type
Matching Pairs of
Records Set of records meeting
blocking rule/algo
“Candidate Pairs”
Set of all pairs of
records
Reference: Getoor, Machanavajjhala
https://users.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf
MinHash LSH
§ Shingling à Splitting/Tokenizing text
§ Min Hashing
§ Locality Sensitive Hashing (LSH)
§ Most Hashing algorithms: small changes in input à large change in
output
§ LSH: small changes in input à small changes in output
Shingling
§ Danielle Dickson
§ ‘Dan’, ‘ani’, ‘nie’,
‘iel’, … ‘son’
§ Dani Dickson
§ ‘Dan’, ‘ani’, ‘ni_’,
‘i_D’, … ‘son’
Jaccard similarity:
!"#$%&$'#!("
)"!("
Shingling
Jaccard similarity:
*
+,
Token Value_1 Value_2
0 Dan Dan
1 ani ani
2 nie null
3 iel null
4 ell null
5 lle null
6 le_ null
7 e_D null
8 _Di _Di
9 Dic Dic
10 ick ick
11 cks cks
12 kso kso
13 son son
14 null ni_
15 null i_D
• What is the distribution of scores when
using different parameters?
• What does the ideal distribution look like?
MinHash
§ Short integer signature for each text
§ Signature is permutations of shingle index
Token Value_1 Value_2
0 Dan Dan
1 ani ani
2 nie null
Index Value_1 Value_2
0 1 1
1 1 1
2 1 0
Index Value_1 Value_2
2 1 0
0 1 1
1 1 1
Value_1 Signature: {0,2}
Value_2 Signature: {0,0}
Signature Matrix:
{0, 0
2, 0}
MinHash LSH
§ Matrix of signatures is divided into bands
§ Bands are hashed and results are put into “buckets” which have candidate pairs
Value_1 Signature: {0,2}
Value_2 Signature: {0,0}
Signature Matrix:
{0, 0
2, 0}
Like bands are
considered in same
bucket
Scoring Pairs
§ Minhash LSH is done feature-wise for each record
§ Scored based on data type
§ SSN/Driver’s License
§ Name
§ Address
Classification
§ Class imbalance
§ Gradient Boosted Trees
§ Featurewise similarity scores as input
§ Match/Not Match as output
Resolving Entities
§ Graphframes connected components
§ Two inputs:
§ Dataframe of nodes – unique patient IDs
§ Dataframe of edges – two ID columns and match probability (0-
1)
§ Transitive linking can create difficult decisions
§ A – B : 0.9
§ B – C : 0.95
§ A – C : 0.3
Data Challenges in Healthcare
§ PHI/PII requires synthetic data for testing
§ PCRs vs claims data
§ Records can be updated
§ Data quality affects blocking
Synthetic Data
first_name last_name address city state zip_code
Danielle Dickson 229 Schroeder Bridge Greenburgh null 93539
Danielle Dickson null Greenburgh null 93539
Danielle Dickson null Greenburgh Texas 95339
null Dickson null Greenburgh Texas 93539
Danielle Dickson 229 Schroeder Bridge Greenburgh Texas null
Danielle Dickson null rGeenburgh Texas 93539
Danielle Dickson 229 Schroeder Bridge null Texas 93539
Jennifer Morales 808 Nancy Burg Jenniferport Mississippi null
Jennifer null null Jenniferport Mississippi 47037
Morales Jennifer 808 Nancy Burg Jenniferport null 47037
null Jennifer 808 Nancy Burg Jenniferport Mississippi 47037
Morales null 808 Nancy Burg Jenniferport Mississippi 47037
Fernandez Nicole 221 Day Port # 262 North Keithfurt null null
null Fernandez 221 Day Port # 262 null Vermont 73334
Nicole Fernandez 221 Day Port uStie 262 North Keithfurt null 73334
Nicole Fernandez 221 Day Port Suite 262 North Keithfurt Vermont 73334
Nicole Fernandez 221 Day Port Suite 262 North Keithfurt Vermont null
Fernandez Nicole 221 Day Port # 262 oNrth Ketihfurt null 73334
Nulls
Typos
Inversions
Replacements
Synthetic Data
§ Naïve Approach: compare all records
§ 300,000 records à 45 Billion candidate pairs
§ Blocking Approach:
§ 4.8 million candidate pairs
Production Data
§ Naïve Approach: compare all records
§ 459,460 records à 105 Billion candidate pairs
§ Blocking Approach:
§ 4.1 million candidate pairs – (is this good?)
Thank You
Question & Answer

Entity Resolution Using Patient Records at CMMI

  • 1.
    Entity Resolution Using PatientRecords at CMMI Brett Luskin Data Scientist at NewWave
  • 2.
    Agenda § Introductions § EntityResolution at CMMI § Question & Answer
  • 3.
    Introductions: Brett Luskin § DataScientist at NewWave § Specializing in Entity Resolution § Has a strange looking dog § Full-service health IT firm § 15+ years serving CMS § Led entity resolution innovation at CMMI § Problem-solving with NLP and ML
  • 4.
    Luke Bilbro § DataScientist at Databricks § Entity Resolution Lead Architect § Virtuoso bongo player Qing Sun Introductions: § Data Scientist at Databricks § Databricks Solutions Expert § Thought about climbing Everest
  • 5.
    Entity Resolution atthe CMS Innovation Center
  • 6.
    § Data driveseverything. § NewWave was tasked with building out one of the most important health data ecosystems in the country. (CMMI) § NewWave led deployment of Databricks-enabled Entity Resolution at nation’s largest healthcare payer (CMS) Key Data NewWave Manages: § (CMS) Integrated Data Repository (IDR) § (CCW) Chronic Condition Warehouse Government + Innovation Center No one else has access to the quality + quantity of health data that NewWave does. CMS is the largest, single- payer for healthcare services in the US. NewWave is at the center of a healthcare revolution at CMMI
  • 7.
    Government + InnovationCenter § A “Model” is a constellation of people, processes, policy and technology: an experiment in how to transform healthcare § The use cases defined by CMMI models are built with best-in- class technology with the capacity to scale nationwide.
  • 8.
  • 9.
    What is EntityResolution (ER)? § Have we seen this person/business/item before? § How many distinct patients are there? § Are these similar sounding things the same thing? § What is the “Golden Record” for this entity? § Entity Resolution is the task of disambiguating records that correspond to real world entities across and within datasets. § Deduplication § Record linking § Canonicalization
  • 10.
    What is EntityResolution (ER)? From a Data Science perspective § Complexity reduction – clustering § Matching records – classification § Resolving matches to entities – graph computing
  • 11.
    The Scaling Challenge name DanielleDickson Jennifer Morales Nicole Fernandez Dickson, D. id1 name1 id2 name2 1 Danielle Dickson 2 Jennifer Morales 1 Danielle Dickson 3 Nicole Fernandez 1 Danielle Dickson 4 Dickson, D. 2 Jennifer Morales 3 Nicole Fernandez 2 Jennifer Morales 4 Dickson, D. 3 Nicole Fernandez 4 Dickson, D.
  • 12.
    The Scaling Challenge §4 records = 6 possible pairs § 4 million records = 8 trillion possible pairs § To process 1 million pairs per second, it would take about 3 months to finish
  • 13.
    Blocking § Reducing thenumber of comparisons made by rule or algorithm § Dependent on data type Matching Pairs of Records Set of records meeting blocking rule/algo “Candidate Pairs” Set of all pairs of records Reference: Getoor, Machanavajjhala https://users.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf
  • 14.
    MinHash LSH § Shinglingà Splitting/Tokenizing text § Min Hashing § Locality Sensitive Hashing (LSH) § Most Hashing algorithms: small changes in input à large change in output § LSH: small changes in input à small changes in output
  • 15.
    Shingling § Danielle Dickson §‘Dan’, ‘ani’, ‘nie’, ‘iel’, … ‘son’ § Dani Dickson § ‘Dan’, ‘ani’, ‘ni_’, ‘i_D’, … ‘son’ Jaccard similarity: !"#$%&$'#!(" )"!("
  • 16.
    Shingling Jaccard similarity: * +, Token Value_1Value_2 0 Dan Dan 1 ani ani 2 nie null 3 iel null 4 ell null 5 lle null 6 le_ null 7 e_D null 8 _Di _Di 9 Dic Dic 10 ick ick 11 cks cks 12 kso kso 13 son son 14 null ni_ 15 null i_D • What is the distribution of scores when using different parameters? • What does the ideal distribution look like?
  • 17.
    MinHash § Short integersignature for each text § Signature is permutations of shingle index Token Value_1 Value_2 0 Dan Dan 1 ani ani 2 nie null Index Value_1 Value_2 0 1 1 1 1 1 2 1 0 Index Value_1 Value_2 2 1 0 0 1 1 1 1 1 Value_1 Signature: {0,2} Value_2 Signature: {0,0} Signature Matrix: {0, 0 2, 0}
  • 18.
    MinHash LSH § Matrixof signatures is divided into bands § Bands are hashed and results are put into “buckets” which have candidate pairs Value_1 Signature: {0,2} Value_2 Signature: {0,0} Signature Matrix: {0, 0 2, 0} Like bands are considered in same bucket
  • 19.
    Scoring Pairs § MinhashLSH is done feature-wise for each record § Scored based on data type § SSN/Driver’s License § Name § Address
  • 20.
    Classification § Class imbalance §Gradient Boosted Trees § Featurewise similarity scores as input § Match/Not Match as output
  • 21.
    Resolving Entities § Graphframesconnected components § Two inputs: § Dataframe of nodes – unique patient IDs § Dataframe of edges – two ID columns and match probability (0- 1) § Transitive linking can create difficult decisions § A – B : 0.9 § B – C : 0.95 § A – C : 0.3
  • 22.
    Data Challenges inHealthcare § PHI/PII requires synthetic data for testing § PCRs vs claims data § Records can be updated § Data quality affects blocking
  • 23.
    Synthetic Data first_name last_nameaddress city state zip_code Danielle Dickson 229 Schroeder Bridge Greenburgh null 93539 Danielle Dickson null Greenburgh null 93539 Danielle Dickson null Greenburgh Texas 95339 null Dickson null Greenburgh Texas 93539 Danielle Dickson 229 Schroeder Bridge Greenburgh Texas null Danielle Dickson null rGeenburgh Texas 93539 Danielle Dickson 229 Schroeder Bridge null Texas 93539 Jennifer Morales 808 Nancy Burg Jenniferport Mississippi null Jennifer null null Jenniferport Mississippi 47037 Morales Jennifer 808 Nancy Burg Jenniferport null 47037 null Jennifer 808 Nancy Burg Jenniferport Mississippi 47037 Morales null 808 Nancy Burg Jenniferport Mississippi 47037 Fernandez Nicole 221 Day Port # 262 North Keithfurt null null null Fernandez 221 Day Port # 262 null Vermont 73334 Nicole Fernandez 221 Day Port uStie 262 North Keithfurt null 73334 Nicole Fernandez 221 Day Port Suite 262 North Keithfurt Vermont 73334 Nicole Fernandez 221 Day Port Suite 262 North Keithfurt Vermont null Fernandez Nicole 221 Day Port # 262 oNrth Ketihfurt null 73334 Nulls Typos Inversions Replacements
  • 24.
    Synthetic Data § NaïveApproach: compare all records § 300,000 records à 45 Billion candidate pairs § Blocking Approach: § 4.8 million candidate pairs
  • 25.
    Production Data § NaïveApproach: compare all records § 459,460 records à 105 Billion candidate pairs § Blocking Approach: § 4.1 million candidate pairs – (is this good?)
  • 26.
  • 27.