Patient Similarity on office laptop
www.vitech.com.ua
The Problem
You have a database of 30M patients with all medical records.
Each patient described by 250K of binary features.
You need a system for finding N most similar patients to a
given one.
Jesus, it’s Big Data, get Hadoop!
www.vitech.com.ua
Extremes
www.vitech.com.ua
Extremes
Pre-compute
none
Pre-compute
none
Pre-compute
all
Pre-compute
all
450+ trillion pairs450+ trillion pairs
Stored as key-values,
more than 1Pb for
values only
Stored as key-values,
more than 1Pb for
values only
Compare 30 million
pairs by 250K
features
Compare 30 million
pairs by 250K
features
37+ Tflops
One Intel i7 would
compute it in 10
minutes (pure
computing time)
37+ Tflops
One Intel i7 would
compute it in 10
minutes (pure
computing time)
Jesus, it’s Big Data, get Hadoop!
www.vitech.com.ua
Extremes: What to do?
Ideas:
1.we don’t need the meaning of
each feature, we only care about
similarity of the patients;
2.we don’t want to compare very
different patients, we want to
compare only the most similar
ones.
www.vitech.com.ua
Idea 1: Reduce dimensionality
Patient 1 Patient 2 Patient 3
Dictionary Code 1 1 1 0
Dictionary Code 2 0 1 0
Dictionary Code 3 1 0 1
Data representation
www.vitech.com.ua
Idea 1: Reduce dimensionality
Jaccard Similarity as metric
J(X,Y) = |X∩Y| / |X Y|∪
www.vitech.com.ua
Idea 1: Reduce dimensionality
Decrease dimensionality of the data while preserving
similarities: LSH with MinHashing
www.vitech.com.ua
Idea 2: Group similar
1. Can’t have ungrouped patients
2. Need to work in minibatches (chunks)
3. Need stochastic guarantees
Size matters.
www.vitech.com.ua
Idea 2: Group similar
Estimating mean
Hoeffding's inequality


   mp 2
max 2exp2ˆ  
ˆ
www.vitech.com.ua
Idea 2: Group similar
Stochastic k-modes
Joint deviation probability:


        
         
   
 
ij
ijijijij
ijijijij
D
mm
mm
ccpccp
ccccp









22
22
maxmax
max
22exp4
2exp22exp2
ˆˆ
ˆ,ˆ



 ijcˆ
 ijcˆ
www.vitech.com.ua
Idea 2: Group similar
Stochastic k-modes
www.vitech.com.ua
Idea 2: Group similar
Stochastic k-modes - convergence
1 74 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97
0
2000
4000
6000
8000
10000
12000
14000
benchmark features changed k-modes features changed
www.vitech.com.ua
Idea 2: Group similar
Group similar patients and store groups as separate files
Store centroids of each cluster in a separate file, too
www.vitech.com.ua
The Solution
1. Load a patient
2. Reduce dimensionality with minhashing
3. Load centroid file
4. Compare patient to every centroid
5. Load cluster file of the closest centroid
6. Compare patient with patients in the cluster
7. Show top N similar
www.vitech.com.ua
The Results
50000 clusters up to ~1000 patients per
cluster
~500Kb-1Mb of every cluster file
~18Mb centroid file
To do similarity search you need:
~20Gb HDD
~20Mb RAM
Search works in ~100 milliseconds on a
regular office laptop
www.vitech.com.ua
What’s next?
Other metrics

Purpose-specific metrics

Time introduction

Hierarchical structuring

Cause-effect introduction
www.vitech.com.ua
What’s next?

Care gaps detection

Risk/cost management

Diagnosis recommendation by pattern

Intervention recommendation
Other applications
www.vitech.com.ua

DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание пропущенных диагнозов_Виктор Сарапин