Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Patient Similarity on office laptop
www.vitech.com.ua
The Problem
You have a database of 30M patients with all medical records.
Each patient described by 250K...
www.vitech.com.ua
Extremes
www.vitech.com.ua
Extremes
Pre-compute
none
Pre-compute
none
Pre-compute
all
Pre-compute
all
450+ trillion pairs450+ trill...
www.vitech.com.ua
Extremes: What to do?
Ideas:
1.we don’t need the meaning of
each feature, we only care about
similarity ...
www.vitech.com.ua
Idea 1: Reduce dimensionality
Patient 1 Patient 2 Patient 3
Dictionary Code 1 1 1 0
Dictionary Code 2 0 ...
www.vitech.com.ua
Idea 1: Reduce dimensionality
Jaccard Similarity as metric
J(X,Y) = |X∩Y| / |X Y|∪
www.vitech.com.ua
Idea 1: Reduce dimensionality
Decrease dimensionality of the data while preserving
similarities: LSH wit...
www.vitech.com.ua
Idea 2: Group similar
1. Can’t have ungrouped patients
2. Need to work in minibatches (chunks)
3. Need s...
www.vitech.com.ua
Idea 2: Group similar
Estimating mean
Hoeffding's inequality


   mp 2
max 2exp2ˆ  
ˆ
www.vitech.com.ua
Idea 2: Group similar
Stochastic k-modes
Joint deviation probability:


        
     ...
www.vitech.com.ua
Idea 2: Group similar
Stochastic k-modes
www.vitech.com.ua
Idea 2: Group similar
Stochastic k-modes - convergence
1 74 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52...
www.vitech.com.ua
Idea 2: Group similar
Group similar patients and store groups as separate files
Store centroids of each ...
www.vitech.com.ua
The Solution
1. Load a patient
2. Reduce dimensionality with minhashing
3. Load centroid file
4. Compare...
www.vitech.com.ua
The Results
50000 clusters up to ~1000 patients per
cluster
~500Kb-1Mb of every cluster file
~18Mb centr...
www.vitech.com.ua
What’s next?
Other metrics

Purpose-specific metrics

Time introduction

Hierarchical structuring

C...
www.vitech.com.ua
What’s next?

Care gaps detection

Risk/cost management

Diagnosis recommendation by pattern

Interv...
www.vitech.com.ua
Upcoming SlideShare
Loading in …5
×

DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание пропущенных диагнозов_Виктор Сарапин

75 views

Published on

DataScience Lab, 13 мая 2017
Сходство пациентов: вычистка дубликатов и предсказание пропущенных диагнозов
Виктор Сарапин (CEO at V.I.Tech)
Как эффективно определять дубликаты на десятках миллионов пациентов, и как определять пропущенные диагнозы и лечебные действия.
Все материалы доступны по ссылке: http://datascience.in.ua/report2017

Published in: Technology
  • Be the first to comment

  • Be the first to like this

DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание пропущенных диагнозов_Виктор Сарапин

  1. 1. Patient Similarity on office laptop
  2. 2. www.vitech.com.ua The Problem You have a database of 30M patients with all medical records. Each patient described by 250K of binary features. You need a system for finding N most similar patients to a given one. Jesus, it’s Big Data, get Hadoop!
  3. 3. www.vitech.com.ua Extremes
  4. 4. www.vitech.com.ua Extremes Pre-compute none Pre-compute none Pre-compute all Pre-compute all 450+ trillion pairs450+ trillion pairs Stored as key-values, more than 1Pb for values only Stored as key-values, more than 1Pb for values only Compare 30 million pairs by 250K features Compare 30 million pairs by 250K features 37+ Tflops One Intel i7 would compute it in 10 minutes (pure computing time) 37+ Tflops One Intel i7 would compute it in 10 minutes (pure computing time) Jesus, it’s Big Data, get Hadoop!
  5. 5. www.vitech.com.ua Extremes: What to do? Ideas: 1.we don’t need the meaning of each feature, we only care about similarity of the patients; 2.we don’t want to compare very different patients, we want to compare only the most similar ones.
  6. 6. www.vitech.com.ua Idea 1: Reduce dimensionality Patient 1 Patient 2 Patient 3 Dictionary Code 1 1 1 0 Dictionary Code 2 0 1 0 Dictionary Code 3 1 0 1 Data representation
  7. 7. www.vitech.com.ua Idea 1: Reduce dimensionality Jaccard Similarity as metric J(X,Y) = |X∩Y| / |X Y|∪
  8. 8. www.vitech.com.ua Idea 1: Reduce dimensionality Decrease dimensionality of the data while preserving similarities: LSH with MinHashing
  9. 9. www.vitech.com.ua Idea 2: Group similar 1. Can’t have ungrouped patients 2. Need to work in minibatches (chunks) 3. Need stochastic guarantees Size matters.
  10. 10. www.vitech.com.ua Idea 2: Group similar Estimating mean Hoeffding's inequality      mp 2 max 2exp2ˆ   ˆ
  11. 11. www.vitech.com.ua Idea 2: Group similar Stochastic k-modes Joint deviation probability:                            ij ijijijij ijijijij D mm mm ccpccp ccccp          22 22 maxmax max 22exp4 2exp22exp2 ˆˆ ˆ,ˆ     ijcˆ  ijcˆ
  12. 12. www.vitech.com.ua Idea 2: Group similar Stochastic k-modes
  13. 13. www.vitech.com.ua Idea 2: Group similar Stochastic k-modes - convergence 1 74 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 0 2000 4000 6000 8000 10000 12000 14000 benchmark features changed k-modes features changed
  14. 14. www.vitech.com.ua Idea 2: Group similar Group similar patients and store groups as separate files Store centroids of each cluster in a separate file, too
  15. 15. www.vitech.com.ua The Solution 1. Load a patient 2. Reduce dimensionality with minhashing 3. Load centroid file 4. Compare patient to every centroid 5. Load cluster file of the closest centroid 6. Compare patient with patients in the cluster 7. Show top N similar
  16. 16. www.vitech.com.ua The Results 50000 clusters up to ~1000 patients per cluster ~500Kb-1Mb of every cluster file ~18Mb centroid file To do similarity search you need: ~20Gb HDD ~20Mb RAM Search works in ~100 milliseconds on a regular office laptop
  17. 17. www.vitech.com.ua What’s next? Other metrics  Purpose-specific metrics  Time introduction  Hierarchical structuring  Cause-effect introduction
  18. 18. www.vitech.com.ua What’s next?  Care gaps detection  Risk/cost management  Diagnosis recommendation by pattern  Intervention recommendation Other applications
  19. 19. www.vitech.com.ua

×