Object similarity with office laptop

•

0 likes•205 views

Sergey Shelpuk

An overview of an algorithm for efficient similarity search on Big Data

Software

Efficient
Similarity Search
on Big Data
with office laptop
Sergii Shelpuk
Head of Data Science, V.I.Tech

The Problem
You have a database of 30M patients with all medical records. Each patient described by
250K of binary features.
You need a system for finding N most similar patients to a given one.
Jesus Christ, it’s Big Data, get Hadoop!

Jesus Christ, it’s Big Data, get Hadoop!
Pre-compute
none
Pre-compute
all
450+ trillion pairs
Stored as key-
values, more than
1Pb for values only
Compare 30
million pairs by
250K features
37+ Tflops
One Intel i7 would
compute it in 10
minutes (pure
computing time)

Can we do better?
Two main ideas:
- we don’t need the meaning of each feature, we only care about
similarity of the patients;
- we don’t want to compare very different patients, we want to
compare only the most similar ones.

Step 1: Reduce dimensionality
Decrease dimensionality of the data while preserving similarities
Locality-sensitive hashing and minhashing

K-Means clustering
K-Means clustering groups similar patients in one group

Step 2: Group similar
Group similar patients and store groups as separate files
Store centroids of each cluster in a separate file, too
cluster1.bin
clusterN.bin

Approach
To find N similar patients:
1. Load a patient
2. Reduce dimensionality with minhashing
3. Load centroid file
4. Compare patient to every centroid
5. Load cluster file of the closest centroid
6. Compare patient with patients in the cluster
7. Show top N similar

Results
50000 clusters up to ~1000 patients per cluster
~500Kb-1Mb of every cluster file
~18Mb centroid file
To do similarity search you need:
~20Gb HDD
~20Mb RAM
Search works in ~100 milliseconds on a regular
office laptop

What's hot

Tokenamooool2000

Biotechnology Lab Day 2jmori

Big data in actionChad Richeson

Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...KamleshKumar394

Big Data: The 4 Layers Everyone Must KnowBernard Marr

9 facts about statice's data anonymization solutionStatice

Group4 Unit5Poleak

Big Data presentation Tensingtensing-gis

Big Data - The 5 Vs Everyone Must KnowBernard Marr

Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...yashbheda

What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...vinayiqbusiness

Big data peresintaion ahmed alshikh

What's hot (12)

Token

Biotechnology Lab Day 2

Big data in action

Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...

Big Data: The 4 Layers Everyone Must Know

9 facts about statice's data anonymization solution

Group4 Unit5

Big Data presentation Tensing

Big Data - The 5 Vs Everyone Must Know

Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...

What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...

Big data peresintaion

Viewers also liked

Data science: A New Profession in ITSergey Shelpuk

Buzzword schemeSergey Shelpuk

How to take over the world with artificial intelligence finalSergey Shelpuk

Machine learning introSergey Shelpuk

Artificial intelligence 2015: Quo Vadis?Sergey Shelpuk

CRISP-DM: a data science project methodologySergey Shelpuk

Machine Learning: Advanced Topics OverviewSergey Shelpuk

Viewers also liked (7)

Data science: A New Profession in IT

Buzzword scheme

How to take over the world with artificial intelligence final

Machine learning intro

Artificial intelligence 2015: Quo Vadis?

CRISP-DM: a data science project methodology

Machine Learning: Advanced Topics Overview

Similar to Object similarity with office laptop

DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...GeeksLab Odessa

PUC Masterclass Big DataArjen de Vries

eScience: A Transformed Scientific MethodDuncan Hull

II-SDV 2015, 20 - 21 April, in NiceDr. Haxel Consult

HyperLogLog Intuition Without Hard MathSimeon Simeonov

Big DataRaja Ram Dutta

Hadoop World 2010 - BAH - Fuzzy TableCloudera, Inc.

Big data analytics, survey r.nabatinabati

Big Data & ML for Clinical DataPaul Agapow

Big Data, The Community and The Commons (May 12, 2014)Robert Grossman

Ir 02Mohammed Romi

Big Data Technology Accelerate Genomics Precision Medicinecscpconf

BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE csandit

Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationJen Stirrup

Becoming DatacentricTimothy Cook

Data massage! databases scaled from one to one million nodes (ulf wendel)Zhang Bo

Prediction of heart disease using classification mining technique on sparkdbpublications

Big Data in Clinical ResearchMike Hogarth, MD, FACMI, FACP

BrightTALK - Semantic AI Semantic Web Company

Kew at the pro-iBiosphere data hackathonnickyn

Similar to Object similarity with office laptop (20)

DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...

PUC Masterclass Big Data

eScience: A Transformed Scientific Method

II-SDV 2015, 20 - 21 April, in Nice

HyperLogLog Intuition Without Hard Math

Big Data

Hadoop World 2010 - BAH - Fuzzy Table

Big data analytics, survey r.nabati

Big Data & ML for Clinical Data

Big Data, The Community and The Commons (May 12, 2014)

Ir 02

Big Data Technology Accelerate Genomics Precision Medicine

BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE

Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation

Becoming Datacentric

Data massage! databases scaled from one to one million nodes (ulf wendel)

Prediction of heart disease using classification mining technique on spark

Big Data in Clinical Research

BrightTALK - Semantic AI

Kew at the pro-iBiosphere data hackathon

Recently uploaded

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

TECUNIQUE: Success Stories: IT Service providermohitmore19

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Active Directory Penetration Testing, cionsystems.com.pdfCionsystems

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

DNT_Corporate presentation know about usDynamic Netsoft

What is Binary Language? Computer Number SystemsJheuzeDellosa

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Right Money Management App For Your Financial GoalsJhone kinadey

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

Recently uploaded (20)

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

HR Software Buyers Guide in 2024 - HRSoftware.com

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...

Salesforce Certified Field Service Consultant

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

why an Opensea Clone Script might be your perfect match.pdf

TECUNIQUE: Success Stories: IT Service provider

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

Active Directory Penetration Testing, cionsystems.com.pdf

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

Der Spagat zwischen BIAS und FAIRNESS (2024)

DNT_Corporate presentation know about us

What is Binary Language? Computer Number Systems

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...

Right Money Management App For Your Financial Goals

Exploring iOS App Development: Simplifying the Process

Object similarity with office laptop

1. Efficient Similarity Search on Big Data with office laptop Sergii Shelpuk Head of Data Science, V.I.Tech

2. The Problem You have a database of 30M patients with all medical records. Each patient described by 250K of binary features. You need a system for finding N most similar patients to a given one. Jesus Christ, it’s Big Data, get Hadoop!

3. Jesus Christ, it’s Big Data, get Hadoop! Pre-compute none Pre-compute all 450+ trillion pairs Stored as key- values, more than 1Pb for values only Compare 30 million pairs by 250K features 37+ Tflops One Intel i7 would compute it in 10 minutes (pure computing time)

4. Can we do better? Two main ideas: - we don’t need the meaning of each feature, we only care about similarity of the patients; - we don’t want to compare very different patients, we want to compare only the most similar ones.

5. Step 1: Reduce dimensionality Decrease dimensionality of the data while preserving similarities Locality-sensitive hashing and minhashing

6. K-Means clustering K-Means clustering groups similar patients in one group

7. Step 2: Group similar Group similar patients and store groups as separate files Store centroids of each cluster in a separate file, too cluster1.bin clusterN.bin

8. Approach To find N similar patients: 1. Load a patient 2. Reduce dimensionality with minhashing 3. Load centroid file 4. Compare patient to every centroid 5. Load cluster file of the closest centroid 6. Compare patient with patients in the cluster 7. Show top N similar

9. Results 50000 clusters up to ~1000 patients per cluster ~500Kb-1Mb of every cluster file ~18Mb centroid file To do similarity search you need: ~20Gb HDD ~20Mb RAM Search works in ~100 milliseconds on a regular office laptop

10. Thank you

Object similarity with office laptop

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Viewers also liked

Viewers also liked (7)

Similar to Object similarity with office laptop

Similar to Object similarity with office laptop (20)

Recently uploaded

Recently uploaded (20)

Object similarity with office laptop