SlideShare a Scribd company logo
1 of 40
Choi1, Myung2, Lim1, Song2
1DataKnow. Lab. 2D&C Lab.
Korea Univ.
Jinyang Gao, H. V. Jagadish, Wei Lu, Beng Chin Ooi
3/ 12
App: Large Scale Image Search in Database
• Find similar images in a large database (e.g. google image search)
Kristen Grauman et al
slide: Yunchao Gong UNC Chapel Hill
4/ 12
Feature Vector? High Dimension?
• Feature Vector: Example
• Nudity detection Alg. Based on Neural Network by Choi
• Image File (png) -> 8 x 8 vector (0, 0, 0, …, 0.3241, 0.00441, …)
• 현업에서는 더 많은 dimension의 feature vector를 사용
5/ 12
Image Search, 그리고 kNN
• 이미지를 나타내는 d-차원의 feature vector 집합 𝔻 ⊂ ℝ 𝑑
• 𝑑1, 𝑑2 ∈ ℝ 𝑑에 대해
• Dist(𝑑1, 𝑑2)가 작으면 𝑑1, 𝑑2 가 서로 유사한 이미지라고 하자.
• Dist(𝑑1, 𝑑2)가 크다면 𝑑1, 𝑑2 가 서로 상이한 이미지라고 하자.
• 질의 이미지 Q 를 ℝ 𝑑
공간 상의 한 점 𝑞 으로 표현해보자
• 𝑞𝑢𝑒𝑟𝑦 𝑞 ∈ ℝ 𝑑
• Q 와 유사한 이미지를 k개 만큼 찾는 문제는 k-NN 문제로 변환 가능
• Return k − NN(𝑞, 𝔻)
R-Tree 기반 kNN Search로
문제 해결 가능?
Curse of dimensionality
6/ 12
Reality Check
• Curse of dimensionality
• [Qin lv et al, Image Similarity Search with Compact Data Structures @CIKM`04]
• poor performance when the number of dimensions is high
Roger Weber et al, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-
Dimensional Spaces @ VLDB`98
7/ 12
Data Sensitive Hashing
• a Solution to the Approximate k-NN Problem in High-Dimensional Space
• 𝛿 − recall K-NN Problem
• Recall:
|𝑞𝑢𝑒𝑟𝑦 𝑟𝑒𝑠𝑢𝑙𝑡 𝑠𝑒𝑡 ∩ 𝐾𝑁𝑁 𝑜𝑏𝑗𝑒𝑐𝑡𝑠|
| 𝐾𝑁𝑁 𝑜𝑏𝑗𝑒𝑐𝑡𝑠 |
Curse of Dimensionality Recall 데이터 분포 영향 기반 기술
Scan X (없음) 1 X N/A
RTree-based Solution O (강함) 1 △ index: Tree
Locality Sensitive Hashing △ (덜함) < 1 O
+ Mathematics
Data Sensitive Hashing △ (덜함) < 1 △
+ Machine Learning
Result Set
Related Work: LSH
𝒓 𝟏, 𝒓 𝟐, 𝒑 𝟏, 𝒑 𝟐 − 𝒔𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒆
𝑯𝒂𝒔𝒉𝒊𝒏𝒈 𝑭𝒂𝒎𝒊𝒍𝒚 𝓗
extract functions
ℎ11, ℎ12 , … , ℎ1𝑚 → g1
ℎ21, ℎ22 , … , ℎ2𝑚 → g2
ℎ𝑙1, ℎ𝑙2 , … , ℎ𝑙𝑚 → g 𝑙
Generating functions
a. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≤ 𝑟1 𝑡ℎ𝑒𝑛 Pr
ℎ 𝑞 = ℎ 𝑜 ≥ 𝑝1
b. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≥ 𝑟2 𝑡ℎ𝑒𝑛 Pr
[ℎ 𝑞 = ℎ(𝑝)] < p2
9/ 12
Locality Sensitive Hashing
• 100 차원의 실수 공간(ℝ100)에서 KNN 문제를 풀어야 한다.
• What if!?
• 유사한 점은 서로 Collision이 일어나고,
• 상이한 점은 Collision이 일어나지 않는
• ℎ𝑖𝑑𝑒𝑎𝑙:ℝ100 → ℤ+이 있다면 어떨까?
Query Point
그러나 이러한 이상적인
함수는 존재하지 않음
10/ 12
𝒓 𝟏, 𝒓 𝟐, 𝒑 𝟏, 𝒑 𝟐 − 𝒔𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒆
𝑯𝒂𝒔𝒉𝒊𝒏𝒈 𝑭𝒂𝒎𝒊𝒍𝒚 𝓗
a. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≤ 𝑟1 𝑡ℎ𝑒𝑛 Pr
ℎ 𝑞 = ℎ 𝑜 ≥ 𝒑 𝟏
b. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≥ 𝑟2 𝑡ℎ𝑒𝑛 Pr
[ℎ 𝑞 = ℎ(𝑝)] ≤ 𝐩 𝟐
• informally,
a. 𝑖𝑓(두 점이 유사하다면) 𝑡ℎ𝑒𝑛 두 점의 hash 함수 값이 같을 확률이 높(아야한)다.
b. 𝑖𝑓(두 점이 유사하지 않다면) 𝑡ℎ𝑒𝑛 두 점의 hash 함수 값이 같을 확률이 낮(아야한)다.
• Intuitively,
• 𝒑 𝟏 = 𝟏, 𝒑 𝟐 = 𝟎인 Hash 함수를 만들 수 있다면?
• 그러나 이러한 이상적인 함수는 존재하지 않음
• Challenging
• ℋ를 도출하는 것 자체가 수학적으로 어려움!
• 도출했다 하더라도 대체로 𝑝1는 낮으며, p2는 높음
문제점 1:
도출은 가능하나,
𝒑 𝟐가 너무 높다 (낮아야 하는데!)
11/ 12
Random projection (backup slide 참조)
• Formally
a. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≤ 𝑟1 𝑡ℎ𝑒𝑛 Pr
ℎ 𝑞 = ℎ 𝑜 ≥ 𝒑 𝟏
b. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≥ 𝑟2 𝑡ℎ𝑒𝑛 Pr
[ℎ 𝑞 = ℎ(𝑝)] ≤ 𝒑 𝟐
slide: Yunchao Gong UNC Chapel Hill
문제점 1:
도출은 가능하나,
𝒑 𝟐가 너무 높다 (낮아야 하는데!)
해결책 1:
함수를 여러 개로(𝒎 개)
묶어서 사용해보자!
12/ 12
• let 𝑔 𝑥 = (ℎ1 𝑥 ,ℎ2 𝑥 ,…,ℎ 𝑚 𝑥 )
• 거리가 먼 두 점 q, p에 대해
• Pr
[𝑔 𝑞 = 𝑔(𝑝)] ≤ Pr
ℎ 𝑞 = ℎ 𝑝
≤ 𝒑 𝟐
≪ 𝒑 𝟐
1Fergus et al
slide: Yunchao Gong UNC Chapel Hill
문제점 1:
도출은 가능하나,
𝒑 𝟐가 너무 높다 (낮아야 하는데!)
해결책 1:
함수를 여러 개로(𝒎 개)
묶어서 사용해보자!
효과: false positive 감소
유사하지 않은 두 점에 대해
[𝑔 𝑞 = 𝑔(𝑝)] ≤ 𝒑 𝟐
≪ 𝒑 𝟐
100 111
13/ 12
Random projection
• 𝒑 𝟏이 아주 높은 0.8이라고 하더라도
• 𝒎=5이라면,
• 유사한 두 점에 대해 Pr
[𝑔 𝑞 = 𝑔(𝑝)] ≥ 0.33
• 즉, 만약 한 개의 𝑔로 Hash table 구성 시,
• 질의 지점 q와 아주 유사한 점의 수가 100개라면
• 그 중 33개 이상 찾는 것을 보장해주겠다는 뜻
• 낮은 Recall을 갖게 됨!
• 𝒍=5라면, 1 − 1 − 𝒑 𝟏
𝒎 𝒍
≥ 0.86이므로,
• 평균적으로 86개 이상 찾을 수 있다는 뜻
slide: Yunchao Gong UNC Chapel Hill
문제점 1:
도출은 가능하나,
𝒑 𝟐가 너무 높다 (낮아야 하는데!)
해결책 1:
함수를 여러 개로(𝑚 개)
묶어서 사용해봤다!
효과: false positive 감소
유사하지 않은 두 점에 대해
[𝑔 𝑞 = 𝑔(𝑝)] ≤ 𝒑 𝟐
≪ 𝒑 𝟐
역효과: false negative도 증가
유사한 두 점에 대해
[𝑔 𝑞 = 𝑔(𝑝)] > 𝒑 𝟏 ≫ 𝒑 𝟏
문제점 2:
𝒑 𝟐
가 낮아지는 바람에,
𝒑 𝟏
도 낮아졌다 (높아야 하는데!)
해결책 2:
g를 여러 개 (𝒍 개) 사용한 후
그 중에서 k-NN을 찾자!
효과: High Recall
즉, 1 − 1 − 𝒑 𝟏
𝒎 𝒍라는
recall을 달성할 수 있음
14/ 12
• a Set of Hash tables Hi 1 ≤ i ≤ 𝑙}
• Hash Function 𝑔i:ℝ100
→ 0,1 𝑚
𝑓𝑜𝑟 Hi
• for example, 𝑚 = 6, 𝑙 = 26
Key Bucket
Key Bucket
Key Bucket
𝑔1 𝑔2 𝑔26
15/ 12
Processing: 도식화
• Query Pont q = 984.29,946.23,…,848.21
• Processing
• Step 1. Candidate Set C = 𝑖=1 𝑡𝑜26 Hi. 𝑔𝑒𝑡(𝑞)
• Step 2. return k_Nearest_Neighbors(q) in C
• linear search
Key Bucket
Key Bucket
Key Bucket
16/ 12
𝒓 𝟏, 𝒓 𝟐, 𝒑 𝟏, 𝒑 𝟐 − 𝒔𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒆
𝑯𝒂𝒔𝒉𝒊𝒏𝒈 𝑭𝒂𝒎𝒊𝒍𝒚 𝓗
extract functions
ℎ11, ℎ12 , … , ℎ1𝑚 → g1
ℎ21, ℎ22 , … , ℎ2𝑚 → g2
ℎ𝑙1, ℎ𝑙2 , … , ℎ𝑙𝑚 → g 𝑙
Generating functions
a. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≤ 𝑟1 𝑡ℎ𝑒𝑛 Pr
ℎ 𝑞 = ℎ 𝑜 ≥ 𝑝1
b. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≥ 𝑟2 𝑡ℎ𝑒𝑛 Pr
[ℎ 𝑞 = ℎ(𝑝)] < p2
Traditional LSH Technique:
① Derive ℋ mathematically
② prove that a. and b. holds for an
arbitrary h ∈ ℋ w.r.t. parameter 𝑟1, 𝑟2
③ Randomly extract functions and
build Hash Table.
In DSH(Data Sensitive Hashing):
① learn ℎ by using adaptive boosting
and ℋ = ℋ ∪ {ℎ}
② If ℋ is not sufficient to guarantee
that a. and b. holds w.r.t ∀ℎ ∈ ℋ go to ①
③ Randomly extract functions and build
Hash Table.
17/ 12
Traditional LSH Technique:
① Derive ℋ mathematically
② prove that a. and b. holds for an
arbitrary h ∈ ℋ w.r.t. parameter 𝑟1, 𝑟2
③ Randomly extract functions and
build Hash Table.
In DSH(Data Sensitive Hashing):
① learn ℎ by using adaptive boosting
and ℋ = ℋ ∪ {ℎ}
② If ℋ is not sufficient to guarantee
that a. and b. holds w.r.t ∀ℎ ∈ ℋ go to ①
③ Randomly extract functions and build
Hash Table.
데이터 분포 고려 기반 기술
Locality Sensitive Hashing
(애당초 Uniform Distribution을 가정했기 때문에 (for ②))
+ Mathematics
Data Sensitive Hashing
(대상 데이터 분포를 기준으로 강제로 h를 뽑아 내기 때문에)
+ Machine Learning
18/ 12
Sensitive 기반 기술
Locality Sensitive Hashing 𝑟1, 𝑟2에 따라 Sensitive한 Hashing
+ Mathematics
Data Sensitive Hashing Data (k-NN 과 non-ck-NN) 에 Sensitive한 Hashing
+ Machine Learning
DSH: demonstration
20/ 12
Example: Data Set
• 100-dimensional data set 𝐷
• 𝐷 = 100
• 10 clusters
21/ 12
Build DSH for D
• DSH dsh = new DSH(10, 1.1, 0.7, 0.6, 0.4, querySet, dataSet);
Parameter Value
k (k-NN) 10
𝛼 (학습률) 1.1
𝛿 (lower bound of recall) 70%
𝑝1 0.6
𝑝2 0.4
Query Set D
Data Set D
22/ 12
• a Set of Hash tables Hi 1 ≤ i ≤ 𝑙}
• Hash Function 𝑔i:ℝ100
→ 0,1 𝑚
𝑓𝑜𝑟 Hi
• for example, 𝑚 = 6, 𝑙 = 26
Key Bucket
Key Bucket
Key Bucket
𝑔1 𝑔2 𝑔26
23/ 12
Query Example
• res = dsh.k_Nearest_Neighbor(q=new Point(984.29, 946.23, ..., 848.21)));
• return 10-aNN objs from the given point q
• DSH’s Property:
• Result set must include at least 70% of the exact 10-NN objs
• Result:
Query Point p:
(984.29, 946.23, ..., 848.21)
10-aNN of P (recall: 100%)
24/ 12
Processing: dsh.k_Nearest_Neighbor(q)
• Query Pont q = 984.29,946.23,…,848.21
• Processing
• Step 1. Candidate Set C = 𝑖=1 𝑡𝑜26 Hi. 𝑔𝑒𝑡(𝑞)
• Step 2. return k_Nearest_Neighbors(q) in C
• linear search
Key Bucket
Key Bucket
Key Bucket
25/ 12
Hi. 𝑔𝑒𝑡(𝑞)
• Query Pont q = 984.29,946.23,…,848.21
• H1. 𝑔𝑒𝑡(𝑞) = H1 𝑔1(𝑞) = H1 1110102 = H1[5810]
• g1 q = (ℎ11 𝑞 ,ℎ12 𝑞 ,…,ℎ16 𝑞 ) = (1,1,0,0,1,0)
26/ 12
Processing: dsh.k_Nearest_Neighbor(q)
• for each H𝑖
• H1. 𝑔𝑒𝑡(𝑞) ={Data(id=93~98, 100~102)}
• H2. 𝑔𝑒𝑡 𝑞 = ...
• ...
• H26. 𝑔𝑒𝑡 𝑞 = ...
• Candidate Set C = 𝑖=1 𝑡𝑜 26 Hi. 𝑔𝑒𝑡(𝑞)
• dsh.k_Nearest_Neighbor(q)
• = k_Nearest_Neighbors(q) in C
27/ 12
Processing: dsh.k_Nearest_Neighbor(q)
• for each H𝑖
• H1. 𝑔𝑒𝑡(𝑞) ={Data(id=93~98, 100~102)}
• H2. 𝑔𝑒𝑡 𝑞 = ...
• ...
• H26. 𝑔𝑒𝑡 𝑞 = ...
• Candidate Set C = 𝑖=1 𝑡𝑜 26 Hi. 𝑔𝑒𝑡(𝑞)
28/ 12
Processing: dsh.k_Nearest_Neighbor(q)
• Candidate Set C = {𝑑𝑎𝑡𝑎93, 𝑑𝑎𝑡𝑎94,…}
• C = 28
• result <- Find k-NN(q) in C
• dsh.k_Nearest_Neighbor(q)
• return result Query Pont q = 984.29, 946.23, … , 848.21 T
How to build DSH for D?
30/ 12
Build DSH for D
• Step 1. Generate 𝓗, Data Sensitive Hashing Family (Chapter 3-4)
• Step 2. Generate Hash Function by Randomly extracting hash functions
Generating Hashing Family
extract functions
ℎ11, ℎ12 , … , ℎ1𝑚 → g1
ℎ21, ℎ22 , … , ℎ2𝑚 → g2
ℎ𝑙1, ℎ𝑙2 , … , ℎ𝑙𝑚 → g 𝑙
Generating functions
•Step 3. for each g 𝑙,
•Initialize Hash Table 𝑇𝑙 for g 𝑙 (<key, value> = <Integer array, Data>)
•for each o ∈ 𝐷, 𝑇𝑙.put(g 𝑙 o , o)
Chapter 3
Chapter 4
Adaptive Boosting: Principle
34/ 12
a Weak Classifier
• a Weak Classifier 𝜑(< 𝑞𝑖, 𝑜𝑗 >) is a function
• Input: <query 𝑞𝑖, data 𝑜𝑗>pair
• Desired output:
0, 𝑖𝑓 𝑜𝑗 ∈ 𝑘𝑁𝑁(𝑞𝑖)
1, 𝑖𝑓 𝑜𝑗 ∉ 𝑐𝑘𝑁𝑁(𝑞𝑖)
a Weak Classifier
kNN Pair < 𝑞𝑖, 𝑜𝑗 >
0 (correct)
a Weak Classifier
non-ckNN Pair < 𝑞𝑖, 𝑜𝑗 >
0 (incorrect)
a Weak Classifier
kNN Pair < 𝑞𝑖, 𝑜𝑗 >
1 (incorrect)
a Weak Classifier
non-ckNN Pair < 𝑞𝑖, 𝑜𝑗 >
1 (correct)
a Weak Classifier may produce
a lot of incorrect result
35/ 12
Weak Classifier 3
Weak Classifier 2
Adaptive Boosting
• Build Strong Classifier by combining several weak classifiers
Weak Classifier 1
1st : Query-Data Pair
(< 𝒒𝒊, 𝒐𝒋 >) Set
Feed back
2nd : Query-Data Pair
(< 𝒒𝒊, 𝒐𝒋 >) Set
Feed back
3rd : Query-Data Pair
(< 𝒒𝒊, 𝒐𝒋 >) Set
Well Classified Pair
36/ 12
Weak Classifier 3
Weak Classifier 2
a Strong Classifier
• Build Strong Classifier by combining several weak classifiers
Weak Classifier 1
Query-Data Pair
(< 𝒒𝒊, 𝒐𝒋 >) Set
a Strong Classifier
Well Classified Pair
37/ 12
Adaptive Boosting
• Build Strong Classifier by combining several weak classifiers
Weak Classifier 3
Weak Classifier 2
Weak Classifier 1
1st : Query-Data Pair
(< 𝒒𝒊, 𝒐𝒋 >) Set
Feed back
2nd : Query-Data Pair
(< 𝒒𝒊, 𝒐𝒋 >) Set
3rd : Query-Data Pair
(< 𝒒𝒊, 𝒐𝒋 >) Set
Well Classified Pair
Single Hash Function Optimization
39/ 12
• Query Set Q = (Q1,Q2,…,Qq)
• Data Set X = (X1,X2,…,Xn)
• Weight Matrix W
• Wij =
1,if Xj is a k − NN of Qi
−1,if Xj is a (s𝑎𝑚𝑝𝑙𝑒𝑑) non − ck − NN of Qi
0, 𝑒𝑙𝑠𝑒
1 2 3 4
( )1 1 0 -1
-1 0 1 1
1 42 3
1 2
1 41 2 23
k = 2, c =
sampling rate = 1
40/ 12
• 𝑎𝑟𝑔min
ℎ 𝑖𝑗 𝜑ℎ < 𝑄𝑖, 𝑋𝑗 > ∙ 𝑊𝑖𝑗
• =𝑎𝑟𝑔min
ℎ 𝑖𝑗 ℎ 𝑄𝑖 − ℎ 𝑋𝑗
∙ 𝑊𝑖𝑗

More Related Content

What's hot

RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기Woong won Lee
GAN with Mathematics
GAN with MathematicsGAN with Mathematics
GAN with MathematicsHyeongmin Lee
Image processing - Histogram Equalization
Image processing - Histogram EqualizationImage processing - Histogram Equalization
Image processing - Histogram Equalization우진 신
Anomaly Detection with GANs
Anomaly Detection with GANsAnomaly Detection with GANs
Anomaly Detection with GANs홍배 김
Recurrent Neural Net의 이론과 설명
Recurrent Neural Net의 이론과 설명Recurrent Neural Net의 이론과 설명
Recurrent Neural Net의 이론과 설명홍배 김
[한글] Tutorial: Sparse variational dropout
[한글] Tutorial: Sparse variational dropout[한글] Tutorial: Sparse variational dropout
[한글] Tutorial: Sparse variational dropoutWuhyun Rico Shin
Reinforcement learning v0.5
Reinforcement learning v0.5Reinforcement learning v0.5
Reinforcement learning v0.5SANG WON PARK
Visualizing data using t-SNE
Visualizing data using t-SNEVisualizing data using t-SNE
Visualizing data using t-SNE홍배 김
Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)홍배 김
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introductionTaehoon Kim
RUCK 2017 빅데이터 분석에서 모형의 역할
RUCK 2017 빅데이터 분석에서 모형의 역할RUCK 2017 빅데이터 분석에서 모형의 역할
RUCK 2017 빅데이터 분석에서 모형의 역할r-kor
[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)Donghyeon Kim
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1San Kim
Unsupervised anomaly detection with generative model
Unsupervised anomaly detection with generative modelUnsupervised anomaly detection with generative model
Unsupervised anomaly detection with generative modelTaeKang Woo
neural network 기초
neural network 기초neural network 기초
neural network 기초Dea-hwan Ki
알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder홍배 김
Vs^3 net for machine reading comprehension question answering
Vs^3 net for machine reading comprehension question answeringVs^3 net for machine reading comprehension question answering
Vs^3 net for machine reading comprehension question answeringNAVER Engineering
03.12 cnn backpropagation
03.12 cnn backpropagation03.12 cnn backpropagation
03.12 cnn backpropagationDea-hwan Ki
Optimization algorithms in machine learning
Optimization algorithms in machine learningOptimization algorithms in machine learning
Optimization algorithms in machine learningYonsei University

What's hot (20)

RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기
GAN with Mathematics
GAN with MathematicsGAN with Mathematics
GAN with Mathematics
Les net
Les netLes net
Les net
Image processing - Histogram Equalization
Image processing - Histogram EqualizationImage processing - Histogram Equalization
Image processing - Histogram Equalization
Anomaly Detection with GANs
Anomaly Detection with GANsAnomaly Detection with GANs
Anomaly Detection with GANs
Recurrent Neural Net의 이론과 설명
Recurrent Neural Net의 이론과 설명Recurrent Neural Net의 이론과 설명
Recurrent Neural Net의 이론과 설명
[한글] Tutorial: Sparse variational dropout
[한글] Tutorial: Sparse variational dropout[한글] Tutorial: Sparse variational dropout
[한글] Tutorial: Sparse variational dropout
Reinforcement learning v0.5
Reinforcement learning v0.5Reinforcement learning v0.5
Reinforcement learning v0.5
Visualizing data using t-SNE
Visualizing data using t-SNEVisualizing data using t-SNE
Visualizing data using t-SNE
Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
RUCK 2017 빅데이터 분석에서 모형의 역할
RUCK 2017 빅데이터 분석에서 모형의 역할RUCK 2017 빅데이터 분석에서 모형의 역할
RUCK 2017 빅데이터 분석에서 모형의 역할
[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1
Unsupervised anomaly detection with generative model
Unsupervised anomaly detection with generative modelUnsupervised anomaly detection with generative model
Unsupervised anomaly detection with generative model
neural network 기초
neural network 기초neural network 기초
neural network 기초
알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder
Vs^3 net for machine reading comprehension question answering
Vs^3 net for machine reading comprehension question answeringVs^3 net for machine reading comprehension question answering
Vs^3 net for machine reading comprehension question answering
03.12 cnn backpropagation
03.12 cnn backpropagation03.12 cnn backpropagation
03.12 cnn backpropagation
Optimization algorithms in machine learning
Optimization algorithms in machine learningOptimization algorithms in machine learning
Optimization algorithms in machine learning

Viewers also liked

Economic Development and Satellite Images
Economic Development and Satellite ImagesEconomic Development and Satellite Images
Economic Development and Satellite ImagesPaul Raschky
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributionsWooSung Choi
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideWooSung Choi
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityAndrii Gakhov

Viewers also liked (6)

Economic Development and Satellite Images
Economic Development and Satellite ImagesEconomic Development and Satellite Images
Economic Development and Satellite Images
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity

Similar to Dsh data sensitive hashing for high dimensional k-nn search

Neural network (perceptron)
Neural network (perceptron)Neural network (perceptron)
Neural network (perceptron)Jeonghun Yoon
Multinomial classification and application of ML
Multinomial classification and application of MLMultinomial classification and application of ML
Multinomial classification and application of ML희수 박
3.neural networks
3.neural networks3.neural networks
3.neural networksHaesun Park
Deep Learning from scratch 4장 : neural network learning
Deep Learning from scratch 4장 : neural network learningDeep Learning from scratch 4장 : neural network learning
Deep Learning from scratch 4장 : neural network learningJinSooKim80
ESM Mid term Review
ESM Mid term ReviewESM Mid term Review
ESM Mid term ReviewMario Cho
개발자를 위한 공감세미나 tensor-flow
개발자를 위한 공감세미나 tensor-flow개발자를 위한 공감세미나 tensor-flow
개발자를 위한 공감세미나 tensor-flow양 한빛
파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강Woong won Lee
03. linear regression
03. linear regression03. linear regression
03. linear regressionJeonghun Yoon
Deep Learning from scratch 5장 : backpropagation
 Deep Learning from scratch 5장 : backpropagation Deep Learning from scratch 5장 : backpropagation
Deep Learning from scratch 5장 : backpropagationJinSooKim80
2.linear regression and logistic regression
2.linear regression and logistic regression2.linear regression and logistic regression
2.linear regression and logistic regressionHaesun Park
2.supervised learning(epoch#2)-3
2.supervised learning(epoch#2)-32.supervised learning(epoch#2)-3
2.supervised learning(epoch#2)-3Haesun Park
인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝Jinwon Lee
Deep Learning from scratch 3장 : neural network
Deep Learning from scratch 3장 : neural networkDeep Learning from scratch 3장 : neural network
Deep Learning from scratch 3장 : neural networkJinSooKim80
02.09 naive bayesian classifier
02.09 naive bayesian classifier02.09 naive bayesian classifier
02.09 naive bayesian classifierDea-hwan Ki
내가 이해하는 SVM(왜, 어떻게를 중심으로)
내가 이해하는 SVM(왜, 어떻게를 중심으로)내가 이해하는 SVM(왜, 어떻게를 중심으로)
내가 이해하는 SVM(왜, 어떻게를 중심으로)SANG WON PARK
Eigendecomposition and pca
Eigendecomposition and pcaEigendecomposition and pca
Eigendecomposition and pcaJinhwan Suk
하스켈 성능 튜닝
하스켈 성능 튜닝하스켈 성능 튜닝
하스켈 성능 튜닝민석 이
04. logistic regression ( 로지스틱 회귀 )
04. logistic regression ( 로지스틱 회귀 )04. logistic regression ( 로지스틱 회귀 )
04. logistic regression ( 로지스틱 회귀 )Jeonghun Yoon

Similar to Dsh data sensitive hashing for high dimensional k-nn search (20)

Neural network (perceptron)
Neural network (perceptron)Neural network (perceptron)
Neural network (perceptron)
Multinomial classification and application of ML
Multinomial classification and application of MLMultinomial classification and application of ML
Multinomial classification and application of ML
3.neural networks
3.neural networks3.neural networks
3.neural networks
Deep Learning from scratch 4장 : neural network learning
Deep Learning from scratch 4장 : neural network learningDeep Learning from scratch 4장 : neural network learning
Deep Learning from scratch 4장 : neural network learning
ESM Mid term Review
ESM Mid term ReviewESM Mid term Review
ESM Mid term Review
개발자를 위한 공감세미나 tensor-flow
개발자를 위한 공감세미나 tensor-flow개발자를 위한 공감세미나 tensor-flow
개발자를 위한 공감세미나 tensor-flow
파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강
03. linear regression
03. linear regression03. linear regression
03. linear regression
Deep Learning from scratch 5장 : backpropagation
 Deep Learning from scratch 5장 : backpropagation Deep Learning from scratch 5장 : backpropagation
Deep Learning from scratch 5장 : backpropagation
2.linear regression and logistic regression
2.linear regression and logistic regression2.linear regression and logistic regression
2.linear regression and logistic regression
2.supervised learning(epoch#2)-3
2.supervised learning(epoch#2)-32.supervised learning(epoch#2)-3
2.supervised learning(epoch#2)-3
인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝
Deep Learning from scratch 3장 : neural network
Deep Learning from scratch 3장 : neural networkDeep Learning from scratch 3장 : neural network
Deep Learning from scratch 3장 : neural network
02.09 naive bayesian classifier
02.09 naive bayesian classifier02.09 naive bayesian classifier
02.09 naive bayesian classifier
내가 이해하는 SVM(왜, 어떻게를 중심으로)
내가 이해하는 SVM(왜, 어떻게를 중심으로)내가 이해하는 SVM(왜, 어떻게를 중심으로)
내가 이해하는 SVM(왜, 어떻게를 중심으로)
Eigendecomposition and pca
Eigendecomposition and pcaEigendecomposition and pca
Eigendecomposition and pca
하스켈 성능 튜닝
하스켈 성능 튜닝하스켈 성능 튜닝
하스켈 성능 튜닝
04. logistic regression ( 로지스틱 회귀 )
04. logistic regression ( 로지스틱 회귀 )04. logistic regression ( 로지스틱 회귀 )
04. logistic regression ( 로지스틱 회귀 )

Dsh data sensitive hashing for high dimensional k-nn search

  • 1. DSH:DataSensitiveHashingfor High-Dimensionalk-NNSearch Choi1, Myung2, Lim1, Song2 1DataKnow. Lab. 2D&C Lab. Korea Univ. Jinyang Gao, H. V. Jagadish, Wei Lu, Beng Chin Ooi SIGMOD `14
  • 3. 3/ 12 App: Large Scale Image Search in Database • Find similar images in a large database (e.g. google image search) Kristen Grauman et al slide: Yunchao Gong UNC Chapel Hill
  • 4. 4/ 12 Feature Vector? High Dimension? • Feature Vector: Example • Nudity detection Alg. Based on Neural Network by Choi • Image File (png) -> 8 x 8 vector (0, 0, 0, …, 0.3241, 0.00441, …) • 현업에서는 더 많은 dimension의 feature vector를 사용
  • 5. 5/ 12 Image Search, 그리고 kNN • 이미지를 나타내는 d-차원의 feature vector 집합 𝔻 ⊂ ℝ 𝑑 • 𝑑1, 𝑑2 ∈ ℝ 𝑑에 대해 • Dist(𝑑1, 𝑑2)가 작으면 𝑑1, 𝑑2 가 서로 유사한 이미지라고 하자. • Dist(𝑑1, 𝑑2)가 크다면 𝑑1, 𝑑2 가 서로 상이한 이미지라고 하자. • 질의 이미지 Q 를 ℝ 𝑑 공간 상의 한 점 𝑞 으로 표현해보자 • 𝑞𝑢𝑒𝑟𝑦 𝑞 ∈ ℝ 𝑑 • Q 와 유사한 이미지를 k개 만큼 찾는 문제는 k-NN 문제로 변환 가능 • Return k − NN(𝑞, 𝔻) R-Tree 기반 kNN Search로 문제 해결 가능? 불가능: Curse of dimensionality
  • 6. 6/ 12 Reality Check • Curse of dimensionality • [Qin lv et al, Image Similarity Search with Compact Data Structures @CIKM`04] • • poor performance when the number of dimensions is high Roger Weber et al, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High- Dimensional Spaces @ VLDB`98
  • 7. 7/ 12 Data Sensitive Hashing • a Solution to the Approximate k-NN Problem in High-Dimensional Space • 𝛿 − recall K-NN Problem • Recall: |𝑞𝑢𝑒𝑟𝑦 𝑟𝑒𝑠𝑢𝑙𝑡 𝑠𝑒𝑡 ∩ 𝐾𝑁𝑁 𝑜𝑏𝑗𝑒𝑐𝑡𝑠| | 𝐾𝑁𝑁 𝑜𝑏𝑗𝑒𝑐𝑡𝑠 | Curse of Dimensionality Recall 데이터 분포 영향 기반 기술 Scan X (없음) 1 X N/A RTree-based Solution O (강함) 1 △ index: Tree Locality Sensitive Hashing △ (덜함) < 1 O Hashing + Mathematics Data Sensitive Hashing △ (덜함) < 1 △ Hashing + Machine Learning KNN objects Query Result Set
  • 8. Related Work: LSH 𝒓 𝟏, 𝒓 𝟐, 𝒑 𝟏, 𝒑 𝟐 − 𝒔𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒆 𝑯𝒂𝒔𝒉𝒊𝒏𝒈 𝑭𝒂𝒎𝒊𝒍𝒚 𝓗 Randomly extract functions ℎ11, ℎ12 , … , ℎ1𝑚 → g1 ℎ21, ℎ22 , … , ℎ2𝑚 → g2 ℎ𝑙1, ℎ𝑙2 , … , ℎ𝑙𝑚 → g 𝑙 … Generating functions a. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≤ 𝑟1 𝑡ℎ𝑒𝑛 Pr ℋ ℎ 𝑞 = ℎ 𝑜 ≥ 𝑝1 b. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≥ 𝑟2 𝑡ℎ𝑒𝑛 Pr ℋ [ℎ 𝑞 = ℎ(𝑝)] < p2
  • 9. 9/ 12 Locality Sensitive Hashing • 100 차원의 실수 공간(ℝ100)에서 KNN 문제를 풀어야 한다. • What if!? • 유사한 점은 서로 Collision이 일어나고, • 상이한 점은 Collision이 일어나지 않는 • ℎ𝑖𝑑𝑒𝑎𝑙:ℝ100 → ℤ+이 있다면 어떨까? Query Point 그러나 이러한 이상적인 함수는 존재하지 않음
  • 10. 10/ 12 Formally, 𝒓 𝟏, 𝒓 𝟐, 𝒑 𝟏, 𝒑 𝟐 − 𝒔𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒆 𝑯𝒂𝒔𝒉𝒊𝒏𝒈 𝑭𝒂𝒎𝒊𝒍𝒚 𝓗 a. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≤ 𝑟1 𝑡ℎ𝑒𝑛 Pr ℋ ℎ 𝑞 = ℎ 𝑜 ≥ 𝒑 𝟏 b. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≥ 𝑟2 𝑡ℎ𝑒𝑛 Pr ℋ [ℎ 𝑞 = ℎ(𝑝)] ≤ 𝐩 𝟐 • informally, a. 𝑖𝑓(두 점이 유사하다면) 𝑡ℎ𝑒𝑛 두 점의 hash 함수 값이 같을 확률이 높(아야한)다. b. 𝑖𝑓(두 점이 유사하지 않다면) 𝑡ℎ𝑒𝑛 두 점의 hash 함수 값이 같을 확률이 낮(아야한)다. • Intuitively, • 𝒑 𝟏 = 𝟏, 𝒑 𝟐 = 𝟎인 Hash 함수를 만들 수 있다면? • 그러나 이러한 이상적인 함수는 존재하지 않음 • Challenging • ℋ를 도출하는 것 자체가 수학적으로 어려움! • 도출했다 하더라도 대체로 𝑝1는 낮으며, p2는 높음 문제점 1: 도출은 가능하나, 𝒑 𝟐가 너무 높다 (낮아야 하는데!)
  • 11. 11/ 12 Random projection (backup slide 참조) • Formally a. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≤ 𝑟1 𝑡ℎ𝑒𝑛 Pr ℋ ℎ 𝑞 = ℎ 𝑜 ≥ 𝒑 𝟏 b. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≥ 𝑟2 𝑡ℎ𝑒𝑛 Pr ℋ [ℎ 𝑞 = ℎ(𝑝)] ≤ 𝒑 𝟐 slide: Yunchao Gong UNC Chapel Hill 문제점 1: 도출은 가능하나, 𝒑 𝟐가 너무 높다 (낮아야 하는데!) 해결책 1: 함수를 여러 개로(𝒎 개) 묶어서 사용해보자! 0 1
  • 12. 12/ 12 m-concatination • let 𝑔 𝑥 = (ℎ1 𝑥 ,ℎ2 𝑥 ,…,ℎ 𝑚 𝑥 ) • 거리가 먼 두 점 q, p에 대해 • Pr 𝑔∈ℊ [𝑔 𝑞 = 𝑔(𝑝)] ≤ Pr ℎ∈ℋ ℎ 𝑞 = ℎ 𝑝 𝑚 ≤ 𝒑 𝟐 𝒎 ≪ 𝒑 𝟐 0 1 0 1 0 1Fergus et al slide: Yunchao Gong UNC Chapel Hill 문제점 1: 도출은 가능하나, 𝒑 𝟐가 너무 높다 (낮아야 하는데!) 해결책 1: 함수를 여러 개로(𝒎 개) 묶어서 사용해보자! 효과: false positive 감소 유사하지 않은 두 점에 대해 Pr 𝑔∈ℊ [𝑔 𝑞 = 𝑔(𝑝)] ≤ 𝒑 𝟐 𝒎 ≪ 𝒑 𝟐 101 001 100 111
  • 13. 13/ 12 Random projection • 𝒑 𝟏이 아주 높은 0.8이라고 하더라도 • 𝒎=5이라면, • 유사한 두 점에 대해 Pr 𝑔∈ℊ [𝑔 𝑞 = 𝑔(𝑝)] ≥ 0.33 • 즉, 만약 한 개의 𝑔로 Hash table 구성 시, • 질의 지점 q와 아주 유사한 점의 수가 100개라면 • 그 중 33개 이상 찾는 것을 보장해주겠다는 뜻 • 낮은 Recall을 갖게 됨! • 𝒍=5라면, 1 − 1 − 𝒑 𝟏 𝒎 𝒍 ≥ 0.86이므로, • 평균적으로 86개 이상 찾을 수 있다는 뜻 slide: Yunchao Gong UNC Chapel Hill 문제점 1: 도출은 가능하나, 𝒑 𝟐가 너무 높다 (낮아야 하는데!) 해결책 1: 함수를 여러 개로(𝑚 개) 묶어서 사용해봤다! 효과: false positive 감소 유사하지 않은 두 점에 대해 Pr 𝑔∈ℊ [𝑔 𝑞 = 𝑔(𝑝)] ≤ 𝒑 𝟐 𝒎 ≪ 𝒑 𝟐 역효과: false negative도 증가 유사한 두 점에 대해 Pr 𝑔∈ℊ [𝑔 𝑞 = 𝑔(𝑝)] > 𝒑 𝟏 ≫ 𝒑 𝟏 𝒎 문제점 2: 𝒑 𝟐 𝒎 가 낮아지는 바람에, 𝒑 𝟏 𝒎 도 낮아졌다 (높아야 하는데!) 해결책 2: g를 여러 개 (𝒍 개) 사용한 후 그 중에서 k-NN을 찾자! 효과: High Recall 즉, 1 − 1 − 𝒑 𝟏 𝒎 𝒍라는 recall을 달성할 수 있음
  • 14. 14/ 12 Structure • LSH • a Set of Hash tables Hi 1 ≤ i ≤ 𝑙} • Hash Function 𝑔i:ℝ100 → 0,1 𝑚 𝑓𝑜𝑟 Hi • for example, 𝑚 = 6, 𝑙 = 26 Key Bucket 000000 000001 ... 111111 H1 Key Bucket 000000 000001 ... 111111 H2 Key Bucket 000000 000001 ... 111111 H26 ... 𝑔1 𝑔2 𝑔26
  • 15. 15/ 12 Processing: 도식화 • Query Pont q = 984.29,946.23,…,848.21 • Processing • Step 1. Candidate Set C = 𝑖=1 𝑡𝑜26 Hi. 𝑔𝑒𝑡(𝑞) • Step 2. return k_Nearest_Neighbors(q) in C • linear search Key Bucket 000000 000001 ... 111111 H1 Key Bucket 000000 000001 ... 111111 H2 Key Bucket 000000 000001 ... 111111 H26 ...𝑔1 𝑔2 𝑔26
  • 16. 16/ 12 Formally, 𝒓 𝟏, 𝒓 𝟐, 𝒑 𝟏, 𝒑 𝟐 − 𝒔𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒆 𝑯𝒂𝒔𝒉𝒊𝒏𝒈 𝑭𝒂𝒎𝒊𝒍𝒚 𝓗 Randomly extract functions ℎ11, ℎ12 , … , ℎ1𝑚 → g1 ℎ21, ℎ22 , … , ℎ2𝑚 → g2 ℎ𝑙1, ℎ𝑙2 , … , ℎ𝑙𝑚 → g 𝑙 … Generating functions a. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≤ 𝑟1 𝑡ℎ𝑒𝑛 Pr ℋ ℎ 𝑞 = ℎ 𝑜 ≥ 𝑝1 b. 𝑖𝑓 𝑑𝑖𝑠𝑡 𝑜, 𝑞 ≥ 𝑟2 𝑡ℎ𝑒𝑛 Pr ℋ [ℎ 𝑞 = ℎ(𝑝)] < p2 Traditional LSH Technique: ① Derive ℋ mathematically ② prove that a. and b. holds for an arbitrary h ∈ ℋ w.r.t. parameter 𝑟1, 𝑟2 ③ Randomly extract functions and build Hash Table. In DSH(Data Sensitive Hashing): ① learn ℎ by using adaptive boosting and ℋ = ℋ ∪ {ℎ} ② If ℋ is not sufficient to guarantee that a. and b. holds w.r.t ∀ℎ ∈ ℋ go to ① ③ Randomly extract functions and build Hash Table.
  • 17. 17/ 12 LSH VS DSH Traditional LSH Technique: ① Derive ℋ mathematically ② prove that a. and b. holds for an arbitrary h ∈ ℋ w.r.t. parameter 𝑟1, 𝑟2 ③ Randomly extract functions and build Hash Table. In DSH(Data Sensitive Hashing): ① learn ℎ by using adaptive boosting and ℋ = ℋ ∪ {ℎ} ② If ℋ is not sufficient to guarantee that a. and b. holds w.r.t ∀ℎ ∈ ℋ go to ① ③ Randomly extract functions and build Hash Table. 데이터 분포 고려 기반 기술 Locality Sensitive Hashing X (애당초 Uniform Distribution을 가정했기 때문에 (for ②)) Hashing + Mathematics Data Sensitive Hashing O (대상 데이터 분포를 기준으로 강제로 h를 뽑아 내기 때문에) Hashing + Machine Learning
  • 18. 18/ 12 LSH VS DSH 2 Sensitive 기반 기술 Locality Sensitive Hashing 𝑟1, 𝑟2에 따라 Sensitive한 Hashing Hashing + Mathematics Data Sensitive Hashing Data (k-NN 과 non-ck-NN) 에 Sensitive한 Hashing Hashing + Machine Learning
  • 20. 20/ 12 Example: Data Set • 100-dimensional data set 𝐷 • 𝐷 = 100 • 10 clusters
  • 21. 21/ 12 Build DSH for D • DSH dsh = new DSH(10, 1.1, 0.7, 0.6, 0.4, querySet, dataSet); Parameter Value k (k-NN) 10 𝛼 (학습률) 1.1 𝛿 (lower bound of recall) 70% 𝑝1 0.6 𝑝2 0.4 Query Set D Data Set D
  • 22. 22/ 12 Structure • DSH • a Set of Hash tables Hi 1 ≤ i ≤ 𝑙} • Hash Function 𝑔i:ℝ100 → 0,1 𝑚 𝑓𝑜𝑟 Hi • for example, 𝑚 = 6, 𝑙 = 26 Key Bucket 000000 000001 ... 111111 H1 Key Bucket 000000 000001 ... 111111 H2 Key Bucket 000000 000001 ... 111111 H26 ... 𝑔1 𝑔2 𝑔26
  • 23. 23/ 12 Query Example • res = dsh.k_Nearest_Neighbor(q=new Point(984.29, 946.23, ..., 848.21))); • return 10-aNN objs from the given point q • DSH’s Property: • Result set must include at least 70% of the exact 10-NN objs • Result: Query Point p: (984.29, 946.23, ..., 848.21) 10-aNN of P (recall: 100%)
  • 24. 24/ 12 Processing: dsh.k_Nearest_Neighbor(q) • Query Pont q = 984.29,946.23,…,848.21 • Processing • Step 1. Candidate Set C = 𝑖=1 𝑡𝑜26 Hi. 𝑔𝑒𝑡(𝑞) • Step 2. return k_Nearest_Neighbors(q) in C • linear search Key Bucket 000000 000001 ... 111111 H1 Key Bucket 000000 000001 ... 111111 H2 Key Bucket 000000 000001 ... 111111 H26 ...𝑔1 𝑔2 𝑔26
  • 25. 25/ 12 Hi. 𝑔𝑒𝑡(𝑞) • Query Pont q = 984.29,946.23,…,848.21 • H1. 𝑔𝑒𝑡(𝑞) = H1 𝑔1(𝑞) = H1 1110102 = H1[5810] • g1 q = (ℎ11 𝑞 ,ℎ12 𝑞 ,…,ℎ16 𝑞 ) = (1,1,0,0,1,0) <H1>
  • 26. 26/ 12 Processing: dsh.k_Nearest_Neighbor(q) • for each H𝑖 • H1. 𝑔𝑒𝑡(𝑞) ={Data(id=93~98, 100~102)} • H2. 𝑔𝑒𝑡 𝑞 = ... • ... • H26. 𝑔𝑒𝑡 𝑞 = ... • Candidate Set C = 𝑖=1 𝑡𝑜 26 Hi. 𝑔𝑒𝑡(𝑞) • dsh.k_Nearest_Neighbor(q) • = k_Nearest_Neighbors(q) in C
  • 27. 27/ 12 Processing: dsh.k_Nearest_Neighbor(q) • for each H𝑖 • H1. 𝑔𝑒𝑡(𝑞) ={Data(id=93~98, 100~102)} • H2. 𝑔𝑒𝑡 𝑞 = ... • ... • H26. 𝑔𝑒𝑡 𝑞 = ... • Candidate Set C = 𝑖=1 𝑡𝑜 26 Hi. 𝑔𝑒𝑡(𝑞)
  • 28. 28/ 12 Processing: dsh.k_Nearest_Neighbor(q) • Candidate Set C = {𝑑𝑎𝑡𝑎93, 𝑑𝑎𝑡𝑎94,…} • C = 28 • result <- Find k-NN(q) in C • dsh.k_Nearest_Neighbor(q) • return result Query Pont q = 984.29, 946.23, … , 848.21 T
  • 29. How to build DSH for D?
  • 30. 30/ 12 Build DSH for D • Step 1. Generate 𝓗, Data Sensitive Hashing Family (Chapter 3-4) • Step 2. Generate Hash Function by Randomly extracting hash functions Generating Hashing Family Randomly extract functions ℎ11, ℎ12 , … , ℎ1𝑚 → g1 ℎ21, ℎ22 , … , ℎ2𝑚 → g2 ℎ𝑙1, ℎ𝑙2 , … , ℎ𝑙𝑚 → g 𝑙 … Generating functions •Step 3. for each g 𝑙, •Initialize Hash Table 𝑇𝑙 for g 𝑙 (<key, value> = <Integer array, Data>) •for each o ∈ 𝐷, 𝑇𝑙.put(g 𝑙 o , o)
  • 34. 34/ 12 a Weak Classifier • a Weak Classifier 𝜑(< 𝑞𝑖, 𝑜𝑗 >) is a function • Input: <query 𝑞𝑖, data 𝑜𝑗>pair • Desired output: 0, 𝑖𝑓 𝑜𝑗 ∈ 𝑘𝑁𝑁(𝑞𝑖) 1, 𝑖𝑓 𝑜𝑗 ∉ 𝑐𝑘𝑁𝑁(𝑞𝑖) a Weak Classifier kNN Pair < 𝑞𝑖, 𝑜𝑗 > 0 (correct) a Weak Classifier non-ckNN Pair < 𝑞𝑖, 𝑜𝑗 > 0 (incorrect) a Weak Classifier kNN Pair < 𝑞𝑖, 𝑜𝑗 > 1 (incorrect) a Weak Classifier non-ckNN Pair < 𝑞𝑖, 𝑜𝑗 > 1 (correct) note: a Weak Classifier may produce a lot of incorrect result
  • 35. 35/ 12 Weak Classifier 3 Weak Classifier 2 Adaptive Boosting • Build Strong Classifier by combining several weak classifiers Weak Classifier 1 1st : Query-Data Pair (< 𝒒𝒊, 𝒐𝒋 >) Set weakclassifiertrainer test Well Classified Pair Badly Classified Pair Feed back 2nd : Query-Data Pair (< 𝒒𝒊, 𝒐𝒋 >) Set Well Classified Pair Badly Classi fied Pair Feed back 3rd : Query-Data Pair (< 𝒒𝒊, 𝒐𝒋 >) Set Well Classified Pair
  • 36. 36/ 12 Weak Classifier 3 Weak Classifier 2 a Strong Classifier • Build Strong Classifier by combining several weak classifiers Weak Classifier 1 Query-Data Pair (< 𝒒𝒊, 𝒐𝒋 >) Set a Strong Classifier Badly Classi fied Pair Well Classified Pair
  • 37. 37/ 12 Adaptive Boosting • Build Strong Classifier by combining several weak classifiers Weak Classifier 3 Weak Classifier 2 Weak Classifier 1 1st : Query-Data Pair (< 𝒒𝒊, 𝒐𝒋 >) Set weakclassifiertrainer test Well Classified Pair Badly Classified Pair Feed back 2nd : Query-Data Pair (< 𝒒𝒊, 𝒐𝒋 >) Set Well Classified Pair Badly Classi fied Pair 3rd : Query-Data Pair (< 𝒒𝒊, 𝒐𝒋 >) Set Well Classified Pair
  • 38. Single Hash Function Optimization
  • 39. 39/ 12 Notation • Query Set Q = (Q1,Q2,…,Qq) • Data Set X = (X1,X2,…,Xn) • Weight Matrix W • Wij = 1,if Xj is a k − NN of Qi −1,if Xj is a (s𝑎𝑚𝑝𝑙𝑒𝑑) non − ck − NN of Qi 0, 𝑒𝑙𝑠𝑒 1 2 1 2 3 4 ( )1 1 0 -1 -1 0 1 1 1 42 3 1 2 1 41 2 23 k = 2, c = 3 2 sampling rate = 1
  • 40. 40/ 12 Objective • 𝑎𝑟𝑔min ℎ 𝑖𝑗 𝜑ℎ < 𝑄𝑖, 𝑋𝑗 > ∙ 𝑊𝑖𝑗 • =𝑎𝑟𝑔min ℎ 𝑖𝑗 ℎ 𝑄𝑖 − ℎ 𝑋𝑗 2 ∙ 𝑊𝑖𝑗

Editor's Notes

  1. 11
  2. 13