Dsh data sensitive hashing for high dimensional k-nn search

DSH:DataSensitiveHashingfor
High-Dimensionalk-NNSearch
Choi1, Myung2, Lim1, Song2
1DataKnow. Lab. 2D&C Lab.
Korea Univ.
Jinyang Gao, H. V. Jagadish, Wei Lu, Beng Chin Ooi
SIGMOD `14

3/ 12
App: Large Scale Image Search in Database
• Find similar images in a large database (e.g. google image search)
Kristen Grauman et al
slide: Yunchao Gong UNC Chapel Hill yunchao@cs.unc.edu

4/ 12
Feature Vector? High Dimension?
• Feature Vector: Example
• Nudity detection Alg. Based on Neural Network by Choi
• Image File (png) -> 8 x 8 vector (0, 0, 0, …, 0.3241, 0.00441, …)
• 현업에서는 더 많은 dimension의 feature vector를 사용

5/ 12
Image Search, 그리고 kNN
• 이미지를 나타내는 d-차원의 feature vector 집합 𝔻 ⊂ ℝ 𝑑
• 𝑑1, 𝑑2 ∈ ℝ 𝑑에 대해
• Dist(𝑑1, 𝑑2)가 작으면 𝑑1, 𝑑2 가 서로 유사한 이미지라고 하자.
• Dist(𝑑1, 𝑑2)가 크다면 𝑑1, 𝑑2 가 서로 상이한 이미지라고 하자.
• 질의 이미지 Q 를 ℝ 𝑑
공간 상의 한 점 𝑞 으로 표현해보자
• 𝑞𝑢𝑒𝑟𝑦 𝑞 ∈ ℝ 𝑑
• Q 와 유사한 이미지를 k개 만큼 찾는 문제는 k-NN 문제로 변환 가능
• Return k − NN(𝑞, 𝔻)
R-Tree 기반 kNN Search로
문제 해결 가능?
불가능:
Curse of dimensionality

6/ 12
Reality Check
• Curse of dimensionality
• [Qin lv et al, Image Similarity Search with Compact Data Structures @CIKM`04]
•
• poor performance when the number of dimensions is high
Roger Weber et al, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-
Dimensional Spaces @ VLDB`98

7/ 12
Data Sensitive Hashing
• a Solution to the Approximate k-NN Problem in High-Dimensional Space
• 𝛿 − recall K-NN Problem
• Recall:
|𝑞𝑢𝑒𝑟𝑦 𝑟𝑒𝑠𝑢𝑙𝑡 𝑠𝑒𝑡 ∩ 𝐾𝑁𝑁 𝑜𝑏𝑗𝑒𝑐𝑡𝑠|
| 𝐾𝑁𝑁 𝑜𝑏𝑗𝑒𝑐𝑡𝑠 |
Curse of Dimensionality Recall 데이터 분포 영향 기반 기술
Scan X (없음) 1 X N/A
RTree-based Solution O (강함) 1 △ index: Tree
Locality Sensitive Hashing △ (덜함) < 1 O
Hashing
+ Mathematics
Data Sensitive Hashing △ (덜함) < 1 △
Hashing
+ Machine Learning
KNN
objects
Query
Result Set

Related Work: LSH
𝒓 𝟏, 𝒓 𝟐, 𝒑 𝟏, 𝒑 𝟐 − 𝒔𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒆
𝑯𝒂𝒔𝒉𝒊𝒏𝒈 𝑭𝒂𝒎𝒊𝒍𝒚 𝓗
Randomly
extract functions
ℎ11, ℎ12 , … , ℎ1𝑚 → g1
ℎ21, ℎ22 , … , ℎ2𝑚 → g2
ℎ𝑙1, ℎ𝑙2 , … , ℎ𝑙𝑚 → g 𝑙
…
Generating functions
a. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≤ 𝑟1 𝑡ℎ𝑒𝑛 Pr
ℋ
ℎ 𝑞 = ℎ 𝑜 ≥ 𝑝1
b. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≥ 𝑟2 𝑡ℎ𝑒𝑛 Pr
ℋ
[ℎ 𝑞 = ℎ(𝑝)] < p2

9/ 12
Locality Sensitive Hashing
• 100 차원의 실수 공간(ℝ100)에서 KNN 문제를 풀어야 한다.
• What if!?
• 유사한 점은 서로 Collision이 일어나고,
• 상이한 점은 Collision이 일어나지 않는
• ℎ𝑖𝑑𝑒𝑎𝑙:ℝ100 → ℤ+이 있다면 어떨까?
Query Point
그러나 이러한 이상적인
함수는 존재하지 않음

10/ 12
Formally,
ℋ
ℎ 𝑞 = ℎ 𝑜 ≥ 𝒑 𝟏
ℋ
[ℎ 𝑞 = ℎ(𝑝)] ≤ 𝐩 𝟐
• informally,
a. 𝑖𝑓(두 점이 유사하다면) 𝑡ℎ𝑒𝑛 두 점의 hash 함수 값이 같을 확률이 높(아야한)다.
b. 𝑖𝑓(두 점이 유사하지 않다면) 𝑡ℎ𝑒𝑛 두 점의 hash 함수 값이 같을 확률이 낮(아야한)다.
• Intuitively,
• 𝒑 𝟏 = 𝟏, 𝒑 𝟐 = 𝟎인 Hash 함수를 만들 수 있다면?
• 그러나 이러한 이상적인 함수는 존재하지 않음
• Challenging
• ℋ를 도출하는 것 자체가 수학적으로 어려움!
• 도출했다 하더라도 대체로 𝑝1는 낮으며, p2는 높음
문제점 1:
도출은 가능하나,
𝒑 𝟐가 너무 높다 (낮아야 하는데!)

11/ 12
Random projection (backup slide 참조)
• Formally
ℋ
ℎ 𝑞 = ℎ 𝑜 ≥ 𝒑 𝟏
ℋ
[ℎ 𝑞 = ℎ(𝑝)] ≤ 𝒑 𝟐
문제점 1:
해결책 1:
함수를 여러 개로(𝒎 개)
묶어서 사용해보자!
0
1

12/ 12
m-concatination
• let 𝑔 𝑥 = (ℎ1 𝑥 ,ℎ2 𝑥 ,…,ℎ 𝑚 𝑥 )
• 거리가 먼 두 점 q, p에 대해
• Pr
𝑔∈ℊ
[𝑔 𝑞 = 𝑔(𝑝)] ≤ Pr
ℎ∈ℋ
ℎ 𝑞 = ℎ 𝑝
𝑚
≤ 𝒑 𝟐
𝒎
≪ 𝒑 𝟐
0
1
0
1
0
1Fergus et al
문제점 1:
해결책 1:
함수를 여러 개로(𝒎 개)
묶어서 사용해보자!
효과: false positive 감소
유사하지 않은 두 점에 대해
Pr
𝑔∈ℊ
[𝑔 𝑞 = 𝑔(𝑝)] ≤ 𝒑 𝟐
𝒎
≪ 𝒑 𝟐
101
001
100 111

13/ 12
Random projection
• 𝒑 𝟏이 아주 높은 0.8이라고 하더라도
• 𝒎=5이라면,
• 유사한 두 점에 대해 Pr
𝑔∈ℊ
[𝑔 𝑞 = 𝑔(𝑝)] ≥ 0.33
• 즉, 만약 한 개의 𝑔로 Hash table 구성 시,
• 질의 지점 q와 아주 유사한 점의 수가 100개라면
• 그 중 33개 이상 찾는 것을 보장해주겠다는 뜻
• 낮은 Recall을 갖게 됨!
• 𝒍=5라면, 1 − 1 − 𝒑 𝟏
𝒎 𝒍
≥ 0.86이므로,
• 평균적으로 86개 이상 찾을 수 있다는 뜻
문제점 1:
해결책 1:
함수를 여러 개로(𝑚 개)
묶어서 사용해봤다!
효과: false positive 감소
유사하지 않은 두 점에 대해
Pr
𝑔∈ℊ
[𝑔 𝑞 = 𝑔(𝑝)] ≤ 𝒑 𝟐
𝒎
≪ 𝒑 𝟐
역효과: false negative도 증가
유사한 두 점에 대해
Pr
𝑔∈ℊ
[𝑔 𝑞 = 𝑔(𝑝)] > 𝒑 𝟏 ≫ 𝒑 𝟏
𝒎
문제점 2:
𝒑 𝟐
𝒎
가 낮아지는 바람에,
𝒑 𝟏
𝒎
도 낮아졌다 (높아야 하는데!)
해결책 2:
g를 여러 개 (𝒍 개) 사용한 후
그 중에서 k-NN을 찾자!
효과: High Recall
즉, 1 − 1 − 𝒑 𝟏
𝒎 𝒍라는
recall을 달성할 수 있음

14/ 12
Structure
• LSH
• a Set of Hash tables Hi 1 ≤ i ≤ 𝑙}
• Hash Function 𝑔i:ℝ100
→ 0,1 𝑚
𝑓𝑜𝑟 Hi
• for example, 𝑚 = 6, 𝑙 = 26
Key Bucket
000000
000001
...
111111
H1
Key Bucket
000000
000001
...
111111
H2
Key Bucket
000000
000001
...
111111
H26
...
𝑔1 𝑔2 𝑔26

15/ 12
Processing: 도식화
• Query Pont q = 984.29,946.23,…,848.21
• Processing
• Step 1. Candidate Set C = 𝑖=1 𝑡𝑜26 Hi. 𝑔𝑒𝑡(𝑞)
• Step 2. return k_Nearest_Neighbors(q) in C
• linear search
Key Bucket
000000
000001
...
111111
H1
Key Bucket
000000
000001
...
111111
H2
Key Bucket
000000
000001
...
111111
H26
...𝑔1
𝑔2
𝑔26

16/ 12
Formally,
Randomly
extract functions
ℎ11, ℎ12 , … , ℎ1𝑚 → g1
ℎ21, ℎ22 , … , ℎ2𝑚 → g2
ℎ𝑙1, ℎ𝑙2 , … , ℎ𝑙𝑚 → g 𝑙
…
ℋ
ℎ 𝑞 = ℎ 𝑜 ≥ 𝑝1
ℋ
[ℎ 𝑞 = ℎ(𝑝)] < p2
Traditional LSH Technique:
① Derive ℋ mathematically
② prove that a. and b. holds for an
arbitrary h ∈ ℋ w.r.t. parameter 𝑟1, 𝑟2
③ Randomly extract functions and
build Hash Table.
In DSH(Data Sensitive Hashing):
① learn ℎ by using adaptive boosting
and ℋ = ℋ ∪ {ℎ}
② If ℋ is not sufficient to guarantee
that a. and b. holds w.r.t ∀ℎ ∈ ℋ go to ①
③ Randomly extract functions and build
Hash Table.

17/ 12
LSH VS DSH
Traditional LSH Technique:
① Derive ℋ mathematically
② prove that a. and b. holds for an
arbitrary h ∈ ℋ w.r.t. parameter 𝑟1, 𝑟2
③ Randomly extract functions and
build Hash Table.
In DSH(Data Sensitive Hashing):
① learn ℎ by using adaptive boosting
and ℋ = ℋ ∪ {ℎ}
② If ℋ is not sufficient to guarantee
that a. and b. holds w.r.t ∀ℎ ∈ ℋ go to ①
③ Randomly extract functions and build
Hash Table.
데이터 분포 고려 기반 기술
Locality Sensitive Hashing
X
(애당초 Uniform Distribution을 가정했기 때문에 (for ②))
Hashing
+ Mathematics
Data Sensitive Hashing
O
(대상 데이터 분포를 기준으로 강제로 h를 뽑아 내기 때문에)
Hashing
+ Machine Learning

18/ 12
LSH VS DSH 2
Sensitive 기반 기술
Locality Sensitive Hashing 𝑟1, 𝑟2에 따라 Sensitive한 Hashing
Hashing
+ Mathematics
Data Sensitive Hashing Data (k-NN 과 non-ck-NN) 에 Sensitive한 Hashing
Hashing
+ Machine Learning

20/ 12
Example: Data Set
• 100-dimensional data set 𝐷
• 𝐷 = 100
• 10 clusters

21/ 12
Build DSH for D
• DSH dsh = new DSH(10, 1.1, 0.7, 0.6, 0.4, querySet, dataSet);
Parameter Value
k (k-NN) 10
𝛼 (학습률) 1.1
𝛿 (lower bound of recall) 70%
𝑝1 0.6
𝑝2 0.4
Query Set D
Data Set D

22/ 12
Structure
• DSH
• a Set of Hash tables Hi 1 ≤ i ≤ 𝑙}
• Hash Function 𝑔i:ℝ100
→ 0,1 𝑚
𝑓𝑜𝑟 Hi
• for example, 𝑚 = 6, 𝑙 = 26
Key Bucket
000000
000001
...
111111
H1
Key Bucket
000000
000001
...
111111
H2
Key Bucket
000000
000001
...
111111
H26
...
𝑔1 𝑔2 𝑔26

23/ 12
Query Example
• res = dsh.k_Nearest_Neighbor(q=new Point(984.29, 946.23, ..., 848.21)));
• return 10-aNN objs from the given point q
• DSH’s Property:
• Result set must include at least 70% of the exact 10-NN objs
• Result:
Query Point p:
(984.29, 946.23, ..., 848.21)
10-aNN of P (recall: 100%)

24/ 12
Processing: dsh.k_Nearest_Neighbor(q)
• Query Pont q = 984.29,946.23,…,848.21
• Processing
• Step 1. Candidate Set C = 𝑖=1 𝑡𝑜26 Hi. 𝑔𝑒𝑡(𝑞)
• Step 2. return k_Nearest_Neighbors(q) in C
• linear search
Key Bucket
000000
000001
...
111111
H1
Key Bucket
000000
000001
...
111111
H2
Key Bucket
000000
000001
...
111111
H26
...𝑔1
𝑔2
𝑔26

25/ 12
Hi. 𝑔𝑒𝑡(𝑞)
• Query Pont q = 984.29,946.23,…,848.21
• H1. 𝑔𝑒𝑡(𝑞) = H1 𝑔1(𝑞) = H1 1110102 = H1[5810]
• g1 q = (ℎ11 𝑞 ,ℎ12 𝑞 ,…,ℎ16 𝑞 ) = (1,1,0,0,1,0)
<H1>

26/ 12
• for each H𝑖
• H1. 𝑔𝑒𝑡(𝑞) ={Data(id=93~98, 100~102)}
• H2. 𝑔𝑒𝑡 𝑞 = ...
• ...
• H26. 𝑔𝑒𝑡 𝑞 = ...
• Candidate Set C = 𝑖=1 𝑡𝑜 26 Hi. 𝑔𝑒𝑡(𝑞)
• dsh.k_Nearest_Neighbor(q)
• = k_Nearest_Neighbors(q) in C

27/ 12
• for each H𝑖
• H1. 𝑔𝑒𝑡(𝑞) ={Data(id=93~98, 100~102)}
• H2. 𝑔𝑒𝑡 𝑞 = ...
• ...
• H26. 𝑔𝑒𝑡 𝑞 = ...
• Candidate Set C = 𝑖=1 𝑡𝑜 26 Hi. 𝑔𝑒𝑡(𝑞)

28/ 12
• Candidate Set C = {𝑑𝑎𝑡𝑎93, 𝑑𝑎𝑡𝑎94,…}
• C = 28
• result <- Find k-NN(q) in C
• dsh.k_Nearest_Neighbor(q)
• return result Query Pont q = 984.29, 946.23, … , 848.21 T

30/ 12
Build DSH for D
• Step 1. Generate 𝓗, Data Sensitive Hashing Family (Chapter 3-4)
• Step 2. Generate Hash Function by Randomly extracting hash functions
Generating Hashing Family
Randomly
extract functions
ℎ11, ℎ12 , … , ℎ1𝑚 → g1
ℎ21, ℎ22 , … , ℎ2𝑚 → g2
ℎ𝑙1, ℎ𝑙2 , … , ℎ𝑙𝑚 → g 𝑙
…
•Step 3. for each g 𝑙,
•Initialize Hash Table 𝑇𝑙 for g 𝑙 (<key, value> = <Integer array, Data>)
•for each o ∈ 𝐷, 𝑇𝑙.put(g 𝑙 o , o)

34/ 12
a Weak Classifier
• a Weak Classifier 𝜑(< 𝑞𝑖, 𝑜𝑗 >) is a function
• Input: <query 𝑞𝑖, data 𝑜𝑗>pair
• Desired output:
0, 𝑖𝑓 𝑜𝑗 ∈ 𝑘𝑁𝑁(𝑞𝑖)
1, 𝑖𝑓 𝑜𝑗 ∉ 𝑐𝑘𝑁𝑁(𝑞𝑖)
a Weak Classifier
kNN Pair < 𝑞𝑖, 𝑜𝑗 >
0 (correct)
a Weak Classifier
non-ckNN Pair < 𝑞𝑖, 𝑜𝑗 >
0 (incorrect)
a Weak Classifier
kNN Pair < 𝑞𝑖, 𝑜𝑗 >
1 (incorrect)
a Weak Classifier
non-ckNN Pair < 𝑞𝑖, 𝑜𝑗 >
1 (correct)
note:
a Weak Classifier may produce
a lot of incorrect result

35/ 12
Weak Classifier 3
Weak Classifier 2
Adaptive Boosting
• Build Strong Classifier by combining several weak classifiers
Weak Classifier 1
1st : Query-Data Pair
(< 𝒒𝒊, 𝒐𝒋 >) Set
weakclassifiertrainer
test
Well
Classified
Pair
Badly
Classified
Pair
Feed back
2nd : Query-Data Pair
Well
Classified
Pair
Badly
Classi
fied
Pair
Feed back
3rd : Query-Data Pair
Well Classified Pair

36/ 12
Weak Classifier 3
Weak Classifier 2
a Strong Classifier
Weak Classifier 1
Query-Data Pair
a Strong Classifier
Badly
Classi
fied
Pair

37/ 12
Adaptive Boosting
Weak Classifier 3
Weak Classifier 2
Weak Classifier 1
1st : Query-Data Pair
weakclassifiertrainer
test
Well
Classified
Pair
Badly
Classified
Pair
Feed back
2nd : Query-Data Pair
Well
Classified
Pair
Badly
Classi
fied
Pair
3rd : Query-Data Pair

Single Hash Function Optimization

39/ 12
Notation
• Query Set Q = (Q1,Q2,…,Qq)
• Data Set X = (X1,X2,…,Xn)
• Weight Matrix W
• Wij =
1,if Xj is a k − NN of Qi
−1,if Xj is a (s𝑎𝑚𝑝𝑙𝑒𝑑) non − ck − NN of Qi
0, 𝑒𝑙𝑠𝑒
1
2
1 2 3 4
( )1 1 0 -1
-1 0 1 1
1 42 3
1 2
1 41 2 23
k = 2, c =
3
2
sampling rate = 1

40/ 12
Objective
• 𝑎𝑟𝑔min
ℎ 𝑖𝑗 𝜑ℎ < 𝑄𝑖, 𝑋𝑗 > ∙ 𝑊𝑖𝑗
• =𝑎𝑟𝑔min
ℎ 𝑖𝑗 ℎ 𝑄𝑖 − ℎ 𝑋𝑗
2
∙ 𝑊𝑖𝑗

Dsh data sensitive hashing for high dimensional k-nn search

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Dsh data sensitive hashing for high dimensional k-nn search

Similar to Dsh data sensitive hashing for high dimensional k-nn search (20)

Dsh data sensitive hashing for high dimensional k-nn search

Editor's Notes