DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание пропущенных диагнозов_Виктор Сарапин

Patient Similarity on office laptop

www.vitech.com.ua
The Problem
You have a database of 30M patients with all medical records.
Each patient described by 250K of binary features.
You need a system for finding N most similar patients to a
given one.
Jesus, it’s Big Data, get Hadoop!

www.vitech.com.ua
Extremes
Pre-compute
none
Pre-compute
none
Pre-compute
all
Pre-compute
all
450+ trillion pairs450+ trillion pairs
Stored as key-values,
more than 1Pb for
values only
Stored as key-values,
more than 1Pb for
values only
Compare 30 million
pairs by 250K
features
Compare 30 million
pairs by 250K
features
37+ Tflops
One Intel i7 would
compute it in 10
minutes (pure
computing time)
37+ Tflops
One Intel i7 would
compute it in 10
minutes (pure
computing time)
Jesus, it’s Big Data, get Hadoop!

www.vitech.com.ua
Extremes: What to do?
Ideas:
1.we don’t need the meaning of
each feature, we only care about
similarity of the patients;
2.we don’t want to compare very
different patients, we want to
compare only the most similar
ones.

www.vitech.com.ua
Idea 1: Reduce dimensionality
Patient 1 Patient 2 Patient 3
Dictionary Code 1 1 1 0
Data representation

www.vitech.com.ua
Jaccard Similarity as metric
J(X,Y) = |X∩Y| / |X Y|∪

www.vitech.com.ua
Decrease dimensionality of the data while preserving
similarities: LSH with MinHashing

www.vitech.com.ua
Idea 2: Group similar
1. Can’t have ungrouped patients
2. Need to work in minibatches (chunks)
3. Need stochastic guarantees
Size matters.

www.vitech.com.ua
Estimating mean
Hoeffding's inequality


   mp 2
max 2exp2ˆ  
ˆ

www.vitech.com.ua
Stochastic k-modes
Joint deviation probability:


        
         
   
 
ij
ijijijij
ijijijij
D
mm
mm
ccpccp
ccccp









22
22
maxmax
max
22exp4
2exp22exp2
ˆˆ
ˆ,ˆ



 ijcˆ
 ijcˆ

www.vitech.com.ua
Stochastic k-modes

www.vitech.com.ua
Stochastic k-modes - convergence
1 74 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97
0
2000
4000
6000
8000
10000
12000
14000
benchmark features changed k-modes features changed

www.vitech.com.ua
Group similar patients and store groups as separate files
Store centroids of each cluster in a separate file, too

www.vitech.com.ua
The Solution
1. Load a patient
2. Reduce dimensionality with minhashing
3. Load centroid file
4. Compare patient to every centroid
5. Load cluster file of the closest centroid
6. Compare patient with patients in the cluster
7. Show top N similar

www.vitech.com.ua
The Results
50000 clusters up to ~1000 patients per
cluster
~500Kb-1Mb of every cluster file
~18Mb centroid file
To do similarity search you need:
~20Gb HDD
~20Mb RAM
Search works in ~100 milliseconds on a
regular office laptop

www.vitech.com.ua
What’s next?
Other metrics

Purpose-specific metrics

Time introduction

Hierarchical structuring

Cause-effect introduction

www.vitech.com.ua
What’s next?

Care gaps detection

Risk/cost management

Diagnosis recommendation by pattern

Intervention recommendation
Other applications

DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание пропущенных диагнозов_Виктор Сарапин

More Related Content

Similar to DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание пропущенных диагнозов_Виктор Сарапин

More from GeeksLab Odessa

Recently uploaded

DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание пропущенных диагнозов_Виктор Сарапин