2. The Problem
You have a database of 30M patients with all medical records. Each patient described by
250K of binary features.
You need a system for finding N most similar patients to a given one.
Jesus Christ, it’s Big Data, get Hadoop!
3. Jesus Christ, it’s Big Data, get Hadoop!
Pre-compute
none
Pre-compute
all
450+ trillion pairs
Stored as key-
values, more than
1Pb for values only
Compare 30
million pairs by
250K features
37+ Tflops
One Intel i7 would
compute it in 10
minutes (pure
computing time)
4. Can we do better?
Two main ideas:
- we don’t need the meaning of each feature, we only care about
similarity of the patients;
- we don’t want to compare very different patients, we want to
compare only the most similar ones.
5. Step 1: Reduce dimensionality
Decrease dimensionality of the data while preserving similarities
Locality-sensitive hashing and minhashing
7. Step 2: Group similar
Group similar patients and store groups as separate files
Store centroids of each cluster in a separate file, too
cluster1.bin
clusterN.bin
8. Approach
To find N similar patients:
1. Load a patient
2. Reduce dimensionality with minhashing
3. Load centroid file
4. Compare patient to every centroid
5. Load cluster file of the closest centroid
6. Compare patient with patients in the cluster
7. Show top N similar
9. Results
50000 clusters up to ~1000 patients per cluster
~500Kb-1Mb of every cluster file
~18Mb centroid file
To do similarity search you need:
~20Gb HDD
~20Mb RAM
Search works in ~100 milliseconds on a regular
office laptop