SlideShare a Scribd company logo
Efficient
Similarity Search
on Big Data
with office laptop
Sergii Shelpuk
Head of Data Science, V.I.Tech
The Problem
You have a database of 30M patients with all medical records. Each patient described by
250K of binary features.
You need a system for finding N most similar patients to a given one.
Jesus Christ, it’s Big Data, get Hadoop!
Jesus Christ, it’s Big Data, get Hadoop!
Pre-compute
none
Pre-compute
all
450+ trillion pairs
Stored as key-
values, more than
1Pb for values only
Compare 30
million pairs by
250K features
37+ Tflops
One Intel i7 would
compute it in 10
minutes (pure
computing time)
Can we do better?
Two main ideas:
- we don’t need the meaning of each feature, we only care about
similarity of the patients;
- we don’t want to compare very different patients, we want to
compare only the most similar ones.
Step 1: Reduce dimensionality
Decrease dimensionality of the data while preserving similarities
Locality-sensitive hashing and minhashing
K-Means clustering
K-Means clustering groups similar patients in one group
Step 2: Group similar
Group similar patients and store groups as separate files
Store centroids of each cluster in a separate file, too
cluster1.bin
clusterN.bin
Approach
To find N similar patients:
1. Load a patient
2. Reduce dimensionality with minhashing
3. Load centroid file
4. Compare patient to every centroid
5. Load cluster file of the closest centroid
6. Compare patient with patients in the cluster
7. Show top N similar
Results
50000 clusters up to ~1000 patients per cluster
~500Kb-1Mb of every cluster file
~18Mb centroid file
To do similarity search you need:
~20Gb HDD
~20Mb RAM
Search works in ~100 milliseconds on a regular
office laptop
Thank you

More Related Content

What's hot

Token
TokenToken
Biotechnology Lab Day 2
Biotechnology Lab   Day 2Biotechnology Lab   Day 2
Biotechnology Lab Day 2
jmori
 
Big data in action
Big data in actionBig data in action
Big data in action
Chad Richeson
 
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
KamleshKumar394
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must Know
Bernard Marr
 
9 facts about statice's data anonymization solution
9 facts about statice's data anonymization solution9 facts about statice's data anonymization solution
9 facts about statice's data anonymization solution
Statice
 
Group4 Unit5
Group4 Unit5Group4 Unit5
Group4 Unit5
Poleak
 
Big Data presentation Tensing
Big Data presentation TensingBig Data presentation Tensing
Big Data presentation Tensing
tensing-gis
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
Bernard Marr
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
yashbheda
 
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
vinayiqbusiness
 
Big data peresintaion
Big data peresintaion Big data peresintaion
Big data peresintaion
ahmed alshikh
 

What's hot (12)

Token
TokenToken
Token
 
Biotechnology Lab Day 2
Biotechnology Lab   Day 2Biotechnology Lab   Day 2
Biotechnology Lab Day 2
 
Big data in action
Big data in actionBig data in action
Big data in action
 
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must Know
 
9 facts about statice's data anonymization solution
9 facts about statice's data anonymization solution9 facts about statice's data anonymization solution
9 facts about statice's data anonymization solution
 
Group4 Unit5
Group4 Unit5Group4 Unit5
Group4 Unit5
 
Big Data presentation Tensing
Big Data presentation TensingBig Data presentation Tensing
Big Data presentation Tensing
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
 
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
 
Big data peresintaion
Big data peresintaion Big data peresintaion
Big data peresintaion
 

Viewers also liked

Data science: A New Profession in IT
Data science: A New Profession in ITData science: A New Profession in IT
Data science: A New Profession in IT
Sergey Shelpuk
 
Buzzword scheme
Buzzword schemeBuzzword scheme
Buzzword scheme
Sergey Shelpuk
 
How to take over the world with artificial intelligence final
How to take over the world with artificial intelligence finalHow to take over the world with artificial intelligence final
How to take over the world with artificial intelligence final
Sergey Shelpuk
 
Machine learning intro
Machine learning introMachine learning intro
Machine learning intro
Sergey Shelpuk
 
Artificial intelligence 2015: Quo Vadis?
Artificial intelligence 2015: Quo Vadis?Artificial intelligence 2015: Quo Vadis?
Artificial intelligence 2015: Quo Vadis?
Sergey Shelpuk
 
CRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodology
Sergey Shelpuk
 
Machine Learning: Advanced Topics Overview
Machine Learning: Advanced Topics OverviewMachine Learning: Advanced Topics Overview
Machine Learning: Advanced Topics Overview
Sergey Shelpuk
 

Viewers also liked (7)

Data science: A New Profession in IT
Data science: A New Profession in ITData science: A New Profession in IT
Data science: A New Profession in IT
 
Buzzword scheme
Buzzword schemeBuzzword scheme
Buzzword scheme
 
How to take over the world with artificial intelligence final
How to take over the world with artificial intelligence finalHow to take over the world with artificial intelligence final
How to take over the world with artificial intelligence final
 
Machine learning intro
Machine learning introMachine learning intro
Machine learning intro
 
Artificial intelligence 2015: Quo Vadis?
Artificial intelligence 2015: Quo Vadis?Artificial intelligence 2015: Quo Vadis?
Artificial intelligence 2015: Quo Vadis?
 
CRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodology
 
Machine Learning: Advanced Topics Overview
Machine Learning: Advanced Topics OverviewMachine Learning: Advanced Topics Overview
Machine Learning: Advanced Topics Overview
 

Similar to Object similarity with office laptop

DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
GeeksLab Odessa
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
Arjen de Vries
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
HyperLogLog Intuition Without Hard Math
HyperLogLog Intuition Without Hard MathHyperLogLog Intuition Without Hard Math
HyperLogLog Intuition Without Hard Math
Simeon Simeonov
 
Big Data
Big DataBig Data
Big Data
Raja Ram Dutta
 
Hadoop World 2010 - BAH - Fuzzy Table
Hadoop World 2010 - BAH - Fuzzy TableHadoop World 2010 - BAH - Fuzzy Table
Hadoop World 2010 - BAH - Fuzzy Table
Cloudera, Inc.
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
nabati
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical Data
Paul Agapow
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
Robert Grossman
 
Ir 02
Ir   02Ir   02
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
csandit
 
Big Data Technology Accelerate Genomics Precision Medicine
Big Data Technology Accelerate Genomics Precision MedicineBig Data Technology Accelerate Genomics Precision Medicine
Big Data Technology Accelerate Genomics Precision Medicine
cscpconf
 
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationDigital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Jen Stirrup
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
Timothy Cook
 
Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)
Zhang Bo
 
Prediction of heart disease using classification mining technique on spark
Prediction of heart disease using classification mining technique on sparkPrediction of heart disease using classification mining technique on spark
Prediction of heart disease using classification mining technique on spark
dbpublications
 
Big Data in Clinical Research
Big Data in Clinical ResearchBig Data in Clinical Research
Big Data in Clinical Research
Mike Hogarth, MD, FACMI, FACP
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
Semantic Web Company
 
Kew at the pro-iBiosphere data hackathon
Kew at the pro-iBiosphere data hackathonKew at the pro-iBiosphere data hackathon
Kew at the pro-iBiosphere data hackathon
nickyn
 

Similar to Object similarity with office laptop (20)

DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
HyperLogLog Intuition Without Hard Math
HyperLogLog Intuition Without Hard MathHyperLogLog Intuition Without Hard Math
HyperLogLog Intuition Without Hard Math
 
Big Data
Big DataBig Data
Big Data
 
Hadoop World 2010 - BAH - Fuzzy Table
Hadoop World 2010 - BAH - Fuzzy TableHadoop World 2010 - BAH - Fuzzy Table
Hadoop World 2010 - BAH - Fuzzy Table
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical Data
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Ir 02
Ir   02Ir   02
Ir 02
 
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
 
Big Data Technology Accelerate Genomics Precision Medicine
Big Data Technology Accelerate Genomics Precision MedicineBig Data Technology Accelerate Genomics Precision Medicine
Big Data Technology Accelerate Genomics Precision Medicine
 
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationDigital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
 
Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)
 
Prediction of heart disease using classification mining technique on spark
Prediction of heart disease using classification mining technique on sparkPrediction of heart disease using classification mining technique on spark
Prediction of heart disease using classification mining technique on spark
 
Big Data in Clinical Research
Big Data in Clinical ResearchBig Data in Clinical Research
Big Data in Clinical Research
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
Kew at the pro-iBiosphere data hackathon
Kew at the pro-iBiosphere data hackathonKew at the pro-iBiosphere data hackathon
Kew at the pro-iBiosphere data hackathon
 

Recently uploaded

Malibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed RoundMalibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed Round
sjcobrien
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
GohKiangHock
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
Marcin Chrost
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
Alberto Brandolini
 
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Envertis Software Solutions
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
dakas1
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
zOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL DifferenceszOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL Differences
YousufSait3
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Lecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptxLecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptx
TaghreedAltamimi
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
mz5nrf0n
 

Recently uploaded (20)

Malibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed RoundMalibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed Round
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
 
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
zOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL DifferenceszOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL Differences
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Lecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptxLecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptx
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
 

Object similarity with office laptop

  • 1. Efficient Similarity Search on Big Data with office laptop Sergii Shelpuk Head of Data Science, V.I.Tech
  • 2. The Problem You have a database of 30M patients with all medical records. Each patient described by 250K of binary features. You need a system for finding N most similar patients to a given one. Jesus Christ, it’s Big Data, get Hadoop!
  • 3. Jesus Christ, it’s Big Data, get Hadoop! Pre-compute none Pre-compute all 450+ trillion pairs Stored as key- values, more than 1Pb for values only Compare 30 million pairs by 250K features 37+ Tflops One Intel i7 would compute it in 10 minutes (pure computing time)
  • 4. Can we do better? Two main ideas: - we don’t need the meaning of each feature, we only care about similarity of the patients; - we don’t want to compare very different patients, we want to compare only the most similar ones.
  • 5. Step 1: Reduce dimensionality Decrease dimensionality of the data while preserving similarities Locality-sensitive hashing and minhashing
  • 6. K-Means clustering K-Means clustering groups similar patients in one group
  • 7. Step 2: Group similar Group similar patients and store groups as separate files Store centroids of each cluster in a separate file, too cluster1.bin clusterN.bin
  • 8. Approach To find N similar patients: 1. Load a patient 2. Reduce dimensionality with minhashing 3. Load centroid file 4. Compare patient to every centroid 5. Load cluster file of the closest centroid 6. Compare patient with patients in the cluster 7. Show top N similar
  • 9. Results 50000 clusters up to ~1000 patients per cluster ~500Kb-1Mb of every cluster file ~18Mb centroid file To do similarity search you need: ~20Gb HDD ~20Mb RAM Search works in ~100 milliseconds on a regular office laptop