SlideShare a Scribd company logo
Finding Similar Items in high-
dimensional spaces: Locality
Sensitive Hashing
Dmitriy Selivanov
04 Sep 2015
Data Science
Statistics
Domain knowledge
Computer science
·
·
·
/
Today's problem: near duplicates detection
Find all pairs of data points within distance threshold
Given: High dimensional data points
And some distance function
· ( , , . . . )x1 x2
Image
User-Item (rating, whatever) matrix
-
-
· d( , )x1 x2
Euclidean distance
Cosine distance
Jaccard distance
-
-
-
J(A, B) =
|A ∩ B|
|A ∪ B|
,xi xj d( , ) < sxi xj
/
Near-neighbor search
Applications
High Dimensions
High-dimensional spaces are lonely places
Duplicate detection (plagiarism, entity resolution)
Recommender systems
Clustering
·
·
·
Bag-of-words model
Large-Scale Recommender Systems (many users vs many items)
·
·
Curse of dimensionality·
almost all pairs of points are equally far away from one another-
/
Finding similar text documents
Examples
Challenges
Mirror websites
Similar news articles (google news, yandex news?)
·
·
Pieces of one document can appear out of order (headers, footers, etc), diffrent lengths.
Large collection of documents can not fit in RAM
Too many documents to compare all pairs - complexity
·
·
· O( )n
2
/
Pipeline
Pieces of one document can appear out of order (headers, footers, etc) =>
Shingling
Documents as sets
Large collection of documents can not fit in RAM =>
Minhashing
Convert large sets to short signatures, while preserving similarity
Too many documents to compare all pairs - complexity =>
Locality Sensitive Hashing
Focus on pairs of signatures likely to be from similar documents.
·
·
·
·
· O( )n
2
·
/
Document representation
Example phrase:
"To be, or not to be, that is the question"
Documents as sets
Bag-of-words => set of words:
k-shingles or k-gram => unordered set of k-grams:
·
{"to", "be", "or", "not", "that", "is", "the", "question"}
don't work well - need ordering
-
-
·
word level (for k = 2)
caharacter level (for k = 3):
-
{"to be", "be or", "or not", "not to", "be that", "that is", "is the", "the
question"}
-
-
{"to ", "o b", " be", "be ", "e o", " or", "or ", "r n", " no", "not", "ot
", "t t", " to", "e t", " th", "tha", "hat", "at ", "t i", " is", "is ", "s
t", "the", "he ", "e q", " qu", "que", "ues", "est", "sti", "tio", "ion"}
-
/
Practical notes
Optionally can compress (hash!) long shingles into 4 byte integers!
Picking
should be picked large enough that the probability of any given shingle appearing in any given
document is low
k
is OK for short documents
is OK for long documents
· k = 5
· k = 10
k
/
Binary term-document-matrix
shingle s1 s2 intersecton union
……. . . . .
осенняя 0 0 - -
светило 1 0 - +
летнее 1 1 + +
солце 1 1 + +
яркое 0 1 - +
погода 0 0 - -
……. . . . .
D1 = "светило летнее солце" => s1 = {"светило", "летнее", "солце"}
D2 = "яркое летнее солнце" => s2 = {"яркое", "летнее", "солце"}
·
·
/
Type of rows
type s1 s2
a 1 1
b 1 0
c 0 1
d 0 0
A, B, C, D - # rows types a, b, c, d
J( , ) =s1 s2
A
(A + B + C)
/
Minhashing
Convert large sets to short signatures
Minhashing example
1. Random permutation of rows of the input-matrix .
2. Minhash function = # of first row in which column .
3. Use independent permutations we will end with minhash functions. => can construct signature-
matrix from input-matrix using these minhash functions.
M
h(c) c == 1
N N
/
Property
(h( ) = h( )) =???Pperm s1 s2
=
Why? Both
remember this result
·
J( , )s1 s2
· A
(A+B+C)
·
/
Implementation of Minhashing
One-pass implementation
Universal hashing
random permutation
random lookup
·
·
1. Instead of permutations pick independent hash-functions ( )
2. For column and hash-function keep slot . Init with +Inf.
3. Scan rows looking for 1
N N N = O(1/ )ϵ2
c hi Sig(i, c)
Suppose row has 1 in column
Then for each : If =>
· r c
· ki (r) < Sig(i, c)hi Sig(i, c) := (r))hi
(x) = ((ax + b) mod p)hi
- integers
- large prime:
· a, b
· p p > N
/
Algorithm
for each row r do begin
for each hash function hi do
compute hi (r);
for each column c
if c has 1 in row r
for each hash function hi do
if hi(r) is smaller than M(i, c) then
M(i, c) := hi(r);
end;
/
Candidates
Still comlexity
Bruteforce
O( )n
2
library('microbenchmark')
library('magrittr')
jaccard <- function(x, y) {
set_intersection <- intersect(x, y) %>% length
set_union <- length(x) + length(y) - set_intersection
return(set_intersection / set_union)
}
elements <- sapply(seq_len(1e5), function(x) paste(sample(letters, 4), collapse = '')) %>% unique
set_1 <- sample(elements, 100, replace = F)
set_2 <- sample(elements, 100, replace = F)
microbenchmark(jaccard(set_1, set_2))
## Unit: microseconds
## expr min lq mean median uq max neval
## jaccard(set_1, set_2) 56.812 58.693 65.64373 61.8075 69.184 161.986 100
/
Job for LSH:
1. Divide M into bands, rows each
2. Hash each column in band into table with large number of buckets
3. Column become candidate if fall into same bucket for any band
b r
bi
/
Bands number tuning
1. The probability that the signatures agree in all rows of one particular band is
2. The probability that the signatures disagree in at least one row of a particular band is
3. The probability that the signatures disagree in at least one row of each of the bands is
4. The probability that the signatures agree in all the rows of at least one band, and therefore become a
candidate pair, is
s
r
1 − s
r
(1 − s
r
)
b
1 − (1 − s
r
)
b
/
Example
Goal : tune and to catch most similar pairs, but few non-similar pairs
100k documents =>
100 hash-functions => signatures of 100 integers = 40mb
find all documents, similar at least
·
·
· b = 20, r = 5
· s = 0.8
1. Probability C1, C2 identical in one particular band:
2. Probability C1, C2 are not similar in all of the 20 bands:
(0.8 = 0.328)
5
(1 − 0.328 = 0.00035)
20
b r
/
LSH function families
Minhash - jaccard similiarity
Random projections (random hypeplanes) - cosine similiarity
p-stable-distributions - Euclidean distance (and norm)
Edit distance
Hamming distance
·
·
· Lp
·
·
/
References
https://www.coursera.org/course/mmds
R package LSHR - https://github.com/dselivanov/LSHR
Jure Leskovec, Anand Rajaraman, Jeff Ullman: Mining of Massive Datasets. http://mmds.org/
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of
dimensionality.
Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji: Hashing for Similarity Search: A Survey
Alexandr Andoni and Piotr Indyk : Near-Optimal Hashing Algorithms for Approximate Nearest
Neighbor in High Dimensions
Mayur Datar, Nicole Immorlica, Piotr Indyk, Vahab S. Mirrokni: Locality-Sensitive Hashing Scheme
Based on p-Stable Distributions
·
·
·
·
·
/

More Related Content

What's hot

Knights tour on chessboard using backtracking
Knights tour on chessboard using backtrackingKnights tour on chessboard using backtracking
Knights tour on chessboard using backtracking
Abhishek Singh
 
Prim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treePrim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning tree
oneous
 
2.5 graph dfs
2.5 graph dfs2.5 graph dfs
2.5 graph dfs
Krish_ver2
 
Depth-First Search
Depth-First SearchDepth-First Search
Depth-First Search
Md. Shafiuzzaman Hira
 
Stack
StackStack
DFS and BFS
DFS and BFSDFS and BFS
DFS and BFS
satya parsana
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
Sangmin Woo
 
Bfs and dfs in data structure
Bfs and dfs in  data structure Bfs and dfs in  data structure
Bfs and dfs in data structure
Ankit Kumar Singh
 
Pda to cfg h2
Pda to cfg h2Pda to cfg h2
Pda to cfg h2
Rajendran
 
Pr083 Non-local Neural Networks
Pr083 Non-local Neural NetworksPr083 Non-local Neural Networks
Pr083 Non-local Neural Networks
Taeoh Kim
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
Krish_ver2
 
Longest Common Subsequence
Longest Common SubsequenceLongest Common Subsequence
Longest Common Subsequence
Krishma Parekh
 
LL(1) parsing
LL(1) parsingLL(1) parsing
LL(1) parsing
KHYATI PATEL
 
PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...
PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...
PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...
Umesh Kumar
 
Breadth first search and depth first search
Breadth first search and  depth first searchBreadth first search and  depth first search
Breadth first search and depth first search
Hossain Md Shakhawat
 
Computer Graphic - Transformations in 2D
Computer Graphic - Transformations in 2DComputer Graphic - Transformations in 2D
Computer Graphic - Transformations in 2D
2013901097
 
Finite automata examples
Finite automata examplesFinite automata examples
Finite automata examples
ankitamakin
 
Analysis and Design of Algorithms notes
Analysis and Design of Algorithms  notesAnalysis and Design of Algorithms  notes
Analysis and Design of Algorithms notes
Prof. Dr. K. Adisesha
 
Sorting Techniques
Sorting TechniquesSorting Techniques
Sorting TechniquesRafay Farooq
 

What's hot (20)

Knights tour on chessboard using backtracking
Knights tour on chessboard using backtrackingKnights tour on chessboard using backtracking
Knights tour on chessboard using backtracking
 
Prim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treePrim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning tree
 
2.5 graph dfs
2.5 graph dfs2.5 graph dfs
2.5 graph dfs
 
Depth-First Search
Depth-First SearchDepth-First Search
Depth-First Search
 
Stack
StackStack
Stack
 
DFS and BFS
DFS and BFSDFS and BFS
DFS and BFS
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 
Bfs and dfs in data structure
Bfs and dfs in  data structure Bfs and dfs in  data structure
Bfs and dfs in data structure
 
Pda to cfg h2
Pda to cfg h2Pda to cfg h2
Pda to cfg h2
 
Pr083 Non-local Neural Networks
Pr083 Non-local Neural NetworksPr083 Non-local Neural Networks
Pr083 Non-local Neural Networks
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
 
Merge sort
Merge sortMerge sort
Merge sort
 
Longest Common Subsequence
Longest Common SubsequenceLongest Common Subsequence
Longest Common Subsequence
 
LL(1) parsing
LL(1) parsingLL(1) parsing
LL(1) parsing
 
PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...
PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...
PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...
 
Breadth first search and depth first search
Breadth first search and  depth first searchBreadth first search and  depth first search
Breadth first search and depth first search
 
Computer Graphic - Transformations in 2D
Computer Graphic - Transformations in 2DComputer Graphic - Transformations in 2D
Computer Graphic - Transformations in 2D
 
Finite automata examples
Finite automata examplesFinite automata examples
Finite automata examples
 
Analysis and Design of Algorithms notes
Analysis and Design of Algorithms  notesAnalysis and Design of Algorithms  notes
Analysis and Design of Algorithms notes
 
Sorting Techniques
Sorting TechniquesSorting Techniques
Sorting Techniques
 

Similar to Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: Locality Sensitive Hashing

Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Paris Women in Machine Learning and Data Science
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
source{d}
 
similarity1 (6).ppt
similarity1 (6).pptsimilarity1 (6).ppt
similarity1 (6).ppt
ssuserf11a32
 
Similarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via HashingSimilarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via Hashing
Maruf Aytekin
 
Python Cheat Sheet
Python Cheat SheetPython Cheat Sheet
Python Cheat Sheet
Muthu Vinayagam
 
Strinng Classes in c++
Strinng Classes in c++Strinng Classes in c++
Strinng Classes in c++Vikash Dhal
 
lecture5.ppt
lecture5.pptlecture5.ppt
lecture5.ppt
KarthiKeyan462713
 
The string class
The string classThe string class
The string class
Syed Zaid Irshad
 
Hashing in datastructure
Hashing in datastructureHashing in datastructure
Hashing in datastructure
rajshreemuthiah
 
Lec5
Lec5Lec5
presentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxpresentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptx
jainaaru59
 
snarks <3 hash functions
snarks <3 hash functionssnarks <3 hash functions
snarks <3 hash functions
Rebekah Mercer
 
1.Buffer Overflows
1.Buffer Overflows1.Buffer Overflows
1.Buffer Overflows
phanleson
 
Algorithms Exam Help
Algorithms Exam HelpAlgorithms Exam Help
Algorithms Exam Help
Programming Exam Help
 
R language introduction
R language introductionR language introduction
R language introduction
Shashwat Shriparv
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10
FredrikRonquist
 
Monads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevMonads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy Dyagilev
JavaDayUA
 

Similar to Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: Locality Sensitive Hashing (20)

Nn
NnNn
Nn
 
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
 
Hashing
HashingHashing
Hashing
 
similarity1 (6).ppt
similarity1 (6).pptsimilarity1 (6).ppt
similarity1 (6).ppt
 
Similarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via HashingSimilarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via Hashing
 
Python Cheat Sheet
Python Cheat SheetPython Cheat Sheet
Python Cheat Sheet
 
I1
I1I1
I1
 
Strinng Classes in c++
Strinng Classes in c++Strinng Classes in c++
Strinng Classes in c++
 
lecture5.ppt
lecture5.pptlecture5.ppt
lecture5.ppt
 
The string class
The string classThe string class
The string class
 
Hashing in datastructure
Hashing in datastructureHashing in datastructure
Hashing in datastructure
 
Lec5
Lec5Lec5
Lec5
 
presentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxpresentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptx
 
snarks <3 hash functions
snarks <3 hash functionssnarks <3 hash functions
snarks <3 hash functions
 
1.Buffer Overflows
1.Buffer Overflows1.Buffer Overflows
1.Buffer Overflows
 
Algorithms Exam Help
Algorithms Exam HelpAlgorithms Exam Help
Algorithms Exam Help
 
R language introduction
R language introductionR language introduction
R language introduction
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10
 
Monads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevMonads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy Dyagilev
 

More from Mail.ru Group

Автоматизация без тест-инженеров по автоматизации, Мария Терехина и Владислав...
Автоматизация без тест-инженеров по автоматизации, Мария Терехина и Владислав...Автоматизация без тест-инженеров по автоматизации, Мария Терехина и Владислав...
Автоматизация без тест-инженеров по автоматизации, Мария Терехина и Владислав...
Mail.ru Group
 
BDD для фронтенда. Автоматизация тестирования с Cucumber, Cypress и Jenkins, ...
BDD для фронтенда. Автоматизация тестирования с Cucumber, Cypress и Jenkins, ...BDD для фронтенда. Автоматизация тестирования с Cucumber, Cypress и Jenkins, ...
BDD для фронтенда. Автоматизация тестирования с Cucumber, Cypress и Jenkins, ...
Mail.ru Group
 
Другая сторона баг-баунти-программ: как это выглядит изнутри, Владимир Дубровин
Другая сторона баг-баунти-программ: как это выглядит изнутри, Владимир ДубровинДругая сторона баг-баунти-программ: как это выглядит изнутри, Владимир Дубровин
Другая сторона баг-баунти-программ: как это выглядит изнутри, Владимир Дубровин
Mail.ru Group
 
Использование Fiddler и Charles при тестировании фронтенда проекта pulse.mail...
Использование Fiddler и Charles при тестировании фронтенда проекта pulse.mail...Использование Fiddler и Charles при тестировании фронтенда проекта pulse.mail...
Использование Fiddler и Charles при тестировании фронтенда проекта pulse.mail...
Mail.ru Group
 
Управление инцидентами в Почте Mail.ru, Антон Викторов
Управление инцидентами в Почте Mail.ru, Антон ВикторовУправление инцидентами в Почте Mail.ru, Антон Викторов
Управление инцидентами в Почте Mail.ru, Антон Викторов
Mail.ru Group
 
DAST в CI/CD, Ольга Свиридова
DAST в CI/CD, Ольга СвиридоваDAST в CI/CD, Ольга Свиридова
DAST в CI/CD, Ольга Свиридова
Mail.ru Group
 
Почему вам стоит использовать свой велосипед и почему не стоит Александр Бел...
Почему вам стоит использовать свой велосипед и почему не стоит  Александр Бел...Почему вам стоит использовать свой велосипед и почему не стоит  Александр Бел...
Почему вам стоит использовать свой велосипед и почему не стоит Александр Бел...
Mail.ru Group
 
CV в пайплайне распознавания ценников товаров: трюки и хитрости Николай Масл...
CV в пайплайне распознавания ценников товаров: трюки и хитрости  Николай Масл...CV в пайплайне распознавания ценников товаров: трюки и хитрости  Николай Масл...
CV в пайплайне распознавания ценников товаров: трюки и хитрости Николай Масл...
Mail.ru Group
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
Mail.ru Group
 
WebAuthn в реальной жизни, Анатолий Остапенко
WebAuthn в реальной жизни, Анатолий ОстапенкоWebAuthn в реальной жизни, Анатолий Остапенко
WebAuthn в реальной жизни, Анатолий Остапенко
Mail.ru Group
 
AMP для электронной почты, Сергей Пешков
AMP для электронной почты, Сергей ПешковAMP для электронной почты, Сергей Пешков
AMP для электронной почты, Сергей Пешков
Mail.ru Group
 
Как мы захотели TWA и сделали его без мобильных разработчиков, Данила Стрелков
Как мы захотели TWA и сделали его без мобильных разработчиков, Данила СтрелковКак мы захотели TWA и сделали его без мобильных разработчиков, Данила Стрелков
Как мы захотели TWA и сделали его без мобильных разработчиков, Данила Стрелков
Mail.ru Group
 
Кейсы использования PWA для партнерских предложений в Delivery Club, Никита Б...
Кейсы использования PWA для партнерских предложений в Delivery Club, Никита Б...Кейсы использования PWA для партнерских предложений в Delivery Club, Никита Б...
Кейсы использования PWA для партнерских предложений в Delivery Club, Никита Б...
Mail.ru Group
 
Метапрограммирование: строим конечный автомат, Сергей Федоров, Яндекс.Такси
Метапрограммирование: строим конечный автомат, Сергей Федоров, Яндекс.ТаксиМетапрограммирование: строим конечный автомат, Сергей Федоров, Яндекс.Такси
Метапрограммирование: строим конечный автомат, Сергей Федоров, Яндекс.Такси
Mail.ru Group
 
Как не сделать врагами архитектуру и оптимизацию, Кирилл Березин, Mail.ru Group
Как не сделать врагами архитектуру и оптимизацию, Кирилл Березин, Mail.ru GroupКак не сделать врагами архитектуру и оптимизацию, Кирилл Березин, Mail.ru Group
Как не сделать врагами архитектуру и оптимизацию, Кирилл Березин, Mail.ru Group
Mail.ru Group
 
Этика искусственного интеллекта, Александр Кармаев (AI Journey)
Этика искусственного интеллекта, Александр Кармаев (AI Journey)Этика искусственного интеллекта, Александр Кармаев (AI Journey)
Этика искусственного интеллекта, Александр Кармаев (AI Journey)
Mail.ru Group
 
Нейро-машинный перевод в вопросно-ответных системах, Федор Федоренко (AI Jour...
Нейро-машинный перевод в вопросно-ответных системах, Федор Федоренко (AI Jour...Нейро-машинный перевод в вопросно-ответных системах, Федор Федоренко (AI Jour...
Нейро-машинный перевод в вопросно-ответных системах, Федор Федоренко (AI Jour...
Mail.ru Group
 
Конвергенция технологий как тренд развития искусственного интеллекта, Владими...
Конвергенция технологий как тренд развития искусственного интеллекта, Владими...Конвергенция технологий как тренд развития искусственного интеллекта, Владими...
Конвергенция технологий как тренд развития искусственного интеллекта, Владими...
Mail.ru Group
 
Обзор трендов рекомендательных систем от Пульса, Андрей Мурашев (AI Journey)
Обзор трендов рекомендательных систем от Пульса, Андрей Мурашев (AI Journey)Обзор трендов рекомендательных систем от Пульса, Андрей Мурашев (AI Journey)
Обзор трендов рекомендательных систем от Пульса, Андрей Мурашев (AI Journey)
Mail.ru Group
 
Мир глазами нейросетей, Данила Байгушев, Александр Сноркин ()
Мир глазами нейросетей, Данила Байгушев, Александр Сноркин ()Мир глазами нейросетей, Данила Байгушев, Александр Сноркин ()
Мир глазами нейросетей, Данила Байгушев, Александр Сноркин ()
Mail.ru Group
 

More from Mail.ru Group (20)

Автоматизация без тест-инженеров по автоматизации, Мария Терехина и Владислав...
Автоматизация без тест-инженеров по автоматизации, Мария Терехина и Владислав...Автоматизация без тест-инженеров по автоматизации, Мария Терехина и Владислав...
Автоматизация без тест-инженеров по автоматизации, Мария Терехина и Владислав...
 
BDD для фронтенда. Автоматизация тестирования с Cucumber, Cypress и Jenkins, ...
BDD для фронтенда. Автоматизация тестирования с Cucumber, Cypress и Jenkins, ...BDD для фронтенда. Автоматизация тестирования с Cucumber, Cypress и Jenkins, ...
BDD для фронтенда. Автоматизация тестирования с Cucumber, Cypress и Jenkins, ...
 
Другая сторона баг-баунти-программ: как это выглядит изнутри, Владимир Дубровин
Другая сторона баг-баунти-программ: как это выглядит изнутри, Владимир ДубровинДругая сторона баг-баунти-программ: как это выглядит изнутри, Владимир Дубровин
Другая сторона баг-баунти-программ: как это выглядит изнутри, Владимир Дубровин
 
Использование Fiddler и Charles при тестировании фронтенда проекта pulse.mail...
Использование Fiddler и Charles при тестировании фронтенда проекта pulse.mail...Использование Fiddler и Charles при тестировании фронтенда проекта pulse.mail...
Использование Fiddler и Charles при тестировании фронтенда проекта pulse.mail...
 
Управление инцидентами в Почте Mail.ru, Антон Викторов
Управление инцидентами в Почте Mail.ru, Антон ВикторовУправление инцидентами в Почте Mail.ru, Антон Викторов
Управление инцидентами в Почте Mail.ru, Антон Викторов
 
DAST в CI/CD, Ольга Свиридова
DAST в CI/CD, Ольга СвиридоваDAST в CI/CD, Ольга Свиридова
DAST в CI/CD, Ольга Свиридова
 
Почему вам стоит использовать свой велосипед и почему не стоит Александр Бел...
Почему вам стоит использовать свой велосипед и почему не стоит  Александр Бел...Почему вам стоит использовать свой велосипед и почему не стоит  Александр Бел...
Почему вам стоит использовать свой велосипед и почему не стоит Александр Бел...
 
CV в пайплайне распознавания ценников товаров: трюки и хитрости Николай Масл...
CV в пайплайне распознавания ценников товаров: трюки и хитрости  Николай Масл...CV в пайплайне распознавания ценников товаров: трюки и хитрости  Николай Масл...
CV в пайплайне распознавания ценников товаров: трюки и хитрости Николай Масл...
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
 
WebAuthn в реальной жизни, Анатолий Остапенко
WebAuthn в реальной жизни, Анатолий ОстапенкоWebAuthn в реальной жизни, Анатолий Остапенко
WebAuthn в реальной жизни, Анатолий Остапенко
 
AMP для электронной почты, Сергей Пешков
AMP для электронной почты, Сергей ПешковAMP для электронной почты, Сергей Пешков
AMP для электронной почты, Сергей Пешков
 
Как мы захотели TWA и сделали его без мобильных разработчиков, Данила Стрелков
Как мы захотели TWA и сделали его без мобильных разработчиков, Данила СтрелковКак мы захотели TWA и сделали его без мобильных разработчиков, Данила Стрелков
Как мы захотели TWA и сделали его без мобильных разработчиков, Данила Стрелков
 
Кейсы использования PWA для партнерских предложений в Delivery Club, Никита Б...
Кейсы использования PWA для партнерских предложений в Delivery Club, Никита Б...Кейсы использования PWA для партнерских предложений в Delivery Club, Никита Б...
Кейсы использования PWA для партнерских предложений в Delivery Club, Никита Б...
 
Метапрограммирование: строим конечный автомат, Сергей Федоров, Яндекс.Такси
Метапрограммирование: строим конечный автомат, Сергей Федоров, Яндекс.ТаксиМетапрограммирование: строим конечный автомат, Сергей Федоров, Яндекс.Такси
Метапрограммирование: строим конечный автомат, Сергей Федоров, Яндекс.Такси
 
Как не сделать врагами архитектуру и оптимизацию, Кирилл Березин, Mail.ru Group
Как не сделать врагами архитектуру и оптимизацию, Кирилл Березин, Mail.ru GroupКак не сделать врагами архитектуру и оптимизацию, Кирилл Березин, Mail.ru Group
Как не сделать врагами архитектуру и оптимизацию, Кирилл Березин, Mail.ru Group
 
Этика искусственного интеллекта, Александр Кармаев (AI Journey)
Этика искусственного интеллекта, Александр Кармаев (AI Journey)Этика искусственного интеллекта, Александр Кармаев (AI Journey)
Этика искусственного интеллекта, Александр Кармаев (AI Journey)
 
Нейро-машинный перевод в вопросно-ответных системах, Федор Федоренко (AI Jour...
Нейро-машинный перевод в вопросно-ответных системах, Федор Федоренко (AI Jour...Нейро-машинный перевод в вопросно-ответных системах, Федор Федоренко (AI Jour...
Нейро-машинный перевод в вопросно-ответных системах, Федор Федоренко (AI Jour...
 
Конвергенция технологий как тренд развития искусственного интеллекта, Владими...
Конвергенция технологий как тренд развития искусственного интеллекта, Владими...Конвергенция технологий как тренд развития искусственного интеллекта, Владими...
Конвергенция технологий как тренд развития искусственного интеллекта, Владими...
 
Обзор трендов рекомендательных систем от Пульса, Андрей Мурашев (AI Journey)
Обзор трендов рекомендательных систем от Пульса, Андрей Мурашев (AI Journey)Обзор трендов рекомендательных систем от Пульса, Андрей Мурашев (AI Journey)
Обзор трендов рекомендательных систем от Пульса, Андрей Мурашев (AI Journey)
 
Мир глазами нейросетей, Данила Байгушев, Александр Сноркин ()
Мир глазами нейросетей, Данила Байгушев, Александр Сноркин ()Мир глазами нейросетей, Данила Байгушев, Александр Сноркин ()
Мир глазами нейросетей, Данила Байгушев, Александр Сноркин ()
 

Recently uploaded

platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
IvanMallco1
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
AlguinaldoKong
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 

Recently uploaded (20)

platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 

Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: Locality Sensitive Hashing

  • 1. Finding Similar Items in high- dimensional spaces: Locality Sensitive Hashing Dmitriy Selivanov 04 Sep 2015
  • 3. Today's problem: near duplicates detection Find all pairs of data points within distance threshold Given: High dimensional data points And some distance function · ( , , . . . )x1 x2 Image User-Item (rating, whatever) matrix - - · d( , )x1 x2 Euclidean distance Cosine distance Jaccard distance - - - J(A, B) = |A ∩ B| |A ∪ B| ,xi xj d( , ) < sxi xj /
  • 4. Near-neighbor search Applications High Dimensions High-dimensional spaces are lonely places Duplicate detection (plagiarism, entity resolution) Recommender systems Clustering · · · Bag-of-words model Large-Scale Recommender Systems (many users vs many items) · · Curse of dimensionality· almost all pairs of points are equally far away from one another- /
  • 5. Finding similar text documents Examples Challenges Mirror websites Similar news articles (google news, yandex news?) · · Pieces of one document can appear out of order (headers, footers, etc), diffrent lengths. Large collection of documents can not fit in RAM Too many documents to compare all pairs - complexity · · · O( )n 2 /
  • 6. Pipeline Pieces of one document can appear out of order (headers, footers, etc) => Shingling Documents as sets Large collection of documents can not fit in RAM => Minhashing Convert large sets to short signatures, while preserving similarity Too many documents to compare all pairs - complexity => Locality Sensitive Hashing Focus on pairs of signatures likely to be from similar documents. · · · · · O( )n 2 · /
  • 7. Document representation Example phrase: "To be, or not to be, that is the question" Documents as sets Bag-of-words => set of words: k-shingles or k-gram => unordered set of k-grams: · {"to", "be", "or", "not", "that", "is", "the", "question"} don't work well - need ordering - - · word level (for k = 2) caharacter level (for k = 3): - {"to be", "be or", "or not", "not to", "be that", "that is", "is the", "the question"} - - {"to ", "o b", " be", "be ", "e o", " or", "or ", "r n", " no", "not", "ot ", "t t", " to", "e t", " th", "tha", "hat", "at ", "t i", " is", "is ", "s t", "the", "he ", "e q", " qu", "que", "ues", "est", "sti", "tio", "ion"} - /
  • 8. Practical notes Optionally can compress (hash!) long shingles into 4 byte integers! Picking should be picked large enough that the probability of any given shingle appearing in any given document is low k is OK for short documents is OK for long documents · k = 5 · k = 10 k /
  • 9. Binary term-document-matrix shingle s1 s2 intersecton union ……. . . . . осенняя 0 0 - - светило 1 0 - + летнее 1 1 + + солце 1 1 + + яркое 0 1 - + погода 0 0 - - ……. . . . . D1 = "светило летнее солце" => s1 = {"светило", "летнее", "солце"} D2 = "яркое летнее солнце" => s2 = {"яркое", "летнее", "солце"} · · /
  • 10. Type of rows type s1 s2 a 1 1 b 1 0 c 0 1 d 0 0 A, B, C, D - # rows types a, b, c, d J( , ) =s1 s2 A (A + B + C) /
  • 11. Minhashing Convert large sets to short signatures Minhashing example 1. Random permutation of rows of the input-matrix . 2. Minhash function = # of first row in which column . 3. Use independent permutations we will end with minhash functions. => can construct signature- matrix from input-matrix using these minhash functions. M h(c) c == 1 N N /
  • 12. Property (h( ) = h( )) =???Pperm s1 s2 = Why? Both remember this result · J( , )s1 s2 · A (A+B+C) · /
  • 13. Implementation of Minhashing One-pass implementation Universal hashing random permutation random lookup · · 1. Instead of permutations pick independent hash-functions ( ) 2. For column and hash-function keep slot . Init with +Inf. 3. Scan rows looking for 1 N N N = O(1/ )ϵ2 c hi Sig(i, c) Suppose row has 1 in column Then for each : If => · r c · ki (r) < Sig(i, c)hi Sig(i, c) := (r))hi (x) = ((ax + b) mod p)hi - integers - large prime: · a, b · p p > N /
  • 14. Algorithm for each row r do begin for each hash function hi do compute hi (r); for each column c if c has 1 in row r for each hash function hi do if hi(r) is smaller than M(i, c) then M(i, c) := hi(r); end; /
  • 15. Candidates Still comlexity Bruteforce O( )n 2 library('microbenchmark') library('magrittr') jaccard <- function(x, y) { set_intersection <- intersect(x, y) %>% length set_union <- length(x) + length(y) - set_intersection return(set_intersection / set_union) } elements <- sapply(seq_len(1e5), function(x) paste(sample(letters, 4), collapse = '')) %>% unique set_1 <- sample(elements, 100, replace = F) set_2 <- sample(elements, 100, replace = F) microbenchmark(jaccard(set_1, set_2)) ## Unit: microseconds ## expr min lq mean median uq max neval ## jaccard(set_1, set_2) 56.812 58.693 65.64373 61.8075 69.184 161.986 100 /
  • 16. Job for LSH: 1. Divide M into bands, rows each 2. Hash each column in band into table with large number of buckets 3. Column become candidate if fall into same bucket for any band b r bi /
  • 17. Bands number tuning 1. The probability that the signatures agree in all rows of one particular band is 2. The probability that the signatures disagree in at least one row of a particular band is 3. The probability that the signatures disagree in at least one row of each of the bands is 4. The probability that the signatures agree in all the rows of at least one band, and therefore become a candidate pair, is s r 1 − s r (1 − s r ) b 1 − (1 − s r ) b /
  • 18. Example Goal : tune and to catch most similar pairs, but few non-similar pairs 100k documents => 100 hash-functions => signatures of 100 integers = 40mb find all documents, similar at least · · · b = 20, r = 5 · s = 0.8 1. Probability C1, C2 identical in one particular band: 2. Probability C1, C2 are not similar in all of the 20 bands: (0.8 = 0.328) 5 (1 − 0.328 = 0.00035) 20 b r /
  • 19. LSH function families Minhash - jaccard similiarity Random projections (random hypeplanes) - cosine similiarity p-stable-distributions - Euclidean distance (and norm) Edit distance Hamming distance · · · Lp · · /
  • 20. References https://www.coursera.org/course/mmds R package LSHR - https://github.com/dselivanov/LSHR Jure Leskovec, Anand Rajaraman, Jeff Ullman: Mining of Massive Datasets. http://mmds.org/ P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji: Hashing for Similarity Search: A Survey Alexandr Andoni and Piotr Indyk : Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions Mayur Datar, Nicole Immorlica, Piotr Indyk, Vahab S. Mirrokni: Locality-Sensitive Hashing Scheme Based on p-Stable Distributions · · · · · /