SlideShare a Scribd company logo
1 of 38
Yasuo Tabei(JST Minato ERATO Project)
Joint work with Takeaki Uno(NII),
Masashi Sugiyama (TITECH), Koji Tsuda (AIST)
Seminar@Sugiyamalab, October 26, 2010(Tue.)
 Motivation
- Development of large-scale data
- Needs for all pairs similarity search method
- Single sorting method and its drawbacks
 Method
- Multiple sorting method
- Locality sensitive hashing
- SketchSort
 Experiments
- Comparison of other state-of-the art methods
- Use large-scale image datasets
 Images
- flicker, image search etc
- 80 million tiny images
(Torralba et al., (2008))
Genome Sequences
- NCBI Sequence Read Archive
- A large-scale genome sequences
from various organisms
Chemical Compounds
- NCBI PubChem
- 28 million chemical
compounds
x=(0.3, -0.3, 0.5, 1.2, …)
Image
Chemical Compound
x=(1, 0, 1, 0, 0, 0, 1, …)
Vector
Text, Protein, DNA/RNA etc
sift
fingerprint
 Mapping vector to binary string (sketch)
- Conserve the distance in the original space
x=(0.3, 0.1, 0.5, 0.6, 0.7, 1.2, -0.2,…)
s=1010010001110001010…
 Advantage
- Can keep giga-scale data in main memory
- Accelerate various algorithms
Mapping
 Finding all neighbor pairs from vector data
- Given a set of data-points
- Find all pairs within a distance ,
s.t. ji xx , ε Δ ),( ji xx
ε
 Can build a neighborhood graph
- Vertex: a data-point
- Edge : a neighbor pair
Applications: semi-supervised learning, spectral
clustering, ROI detection in images, retrieval of
protein sequences, etc
ε
 Finding neighbor pairs by sorting
 Map vector data to skeches
(a)Input
1:101111
2:110101
3:110010
4:010000
5:101000
6:111100
7:000000
8:010110
9:110110
10:100100
(b) Sort
7:000000
4:010000
8:010110
10:100100
5:101000
1:101111
3:110010
2:110101
9:110110
6:111100
(c) Scan neighbors
7:000000
4:010000
8:010110
10:100100
5:101000
1:101111
3:110010
2:110101
9:110110
6:111100
w
w
 Need a large number of distance calcuration for
achieving reasonable accuracy
 Can not derive an analytical estimation of the
fraction of missing neighbors
 Motivation
- Large-scale data
- Needs for all pairs similarity search method
- Single sorting method, and its drawbacks
 Method
- Multiple sorting method
- Locality sensitive hashing
- SketchSort
 Experiments
- Comparison of other state-of-the art methods
- Use large-scale image datasets
 Input: set of fixed-length strings S={s1,…,sn }
 Output: all pairs of strings within a Hamming
distance d
 By appling radixsort, enumerate all pairs in O(n+m)
- n: number of strings, m: output pairs
 Introduce block-wise masking technique for acceleration
Block level
duplication?
Check Hamming
Distance?
Trash
Trash
Yes
>d
No
≦d
Output
Enumerate all pairs
which share matching
block combinations by
MSM
Input fixed length strings
s1=01001001001010100
s2=11010010010101010
s3=00010100100100100
…
s6:1111 0011 1001 0111
s8:1001 0011 1001 0111
 Sort strings by radixsort, divide strings into equivalence
classes O(n)
 Draw edges within all strings in an equivalece class O(m)
 Computational Complexity: O(n+m)
EMILY
DAVID
CHRIS
ALICE
DAVID
BOBBY
DAVID
ALICE
Sort
ALICE
ALICE
BOBBY
CHRIS
DAVID
DAVID
DAVID
EMILY
Equivalence
Classes
 Mask d characters in all possible ways
 Performe radixsort times
 Linear time to the number of strings
 Time exponential to d, polynomial to the length of
strings l
 Ex)d=2






d
l
7:0000 0001 0011 1110
4:0100 0001 1101 1100
8:0101 1001 0111 1000
10:1001 0011 1001 0111
5:1010 0010 1110 1010
1:1011 1111 0011 1110
2:1101 0111 0111 0001
3:1100 1000 1101 1100
9:1101 1000 1101 1110
6:1111 0011 1001 0111
7:0000 0001 0011 1110
4:0100 0001 1101 1100
8:0101 1001 0111 1000
5:1010 0010 1110 1010
3:1100 1000 1101 1100
6:1111 0011 1001 0111
10:1001 0011 1001 0111
2:1101 0111 0111 0001
9:1101 1000 1101 1110
1:1011 1111 0011 1110
 Mask d blocks in all possible ways
 Can reduce the number of sorting operations
 Non-neighbor might be detected
 Filter out by calcurating actual Hamming distance
 Ex)d=2
7:0000 0001 0011 1110
4:0100 0001 1101 1100
8:0101 1001 0111 1000
10:1001 0011 1001 0111
5:1010 0010 1110 1010
1:1011 1111 0011 1110
3:1100 1000 0000 0000
2:1101 0111 0000 0000
9:1101 1000 0000 0000
6:1111 0011 0000 0000
7:0000 0001 0011 1110
4:0100 0001 1101 1100
8:0101 1001 0111 1000
10:1001 0011 1001 0111
5:1010 0000 1110 0000
1:1011 0000 0011 0000
3:1100 0000 1101 0000
2:1101 0000 0111 0000
9:1101 0000 1101 0000
6:1111 0000 1001 0000
7:0000 0000 0000 1110
4:0100 0000 0000 1100
8:0101 0000 0000 1000
10:1001 0000 0000 0111
5:1010 0000 0000 1010
1:1011 0000 0000 1110
3:1100 0000 0000 1100
2:1101 0000 0000 0001
9:1101 0000 0000 1110
6:1111 0000 0000 0111
7:0000 0001 0011 0000
4:0000 0001 1101 0000
5:0000 0010 1110 0000
6:0000 0011 1001 0000
10:0000 0011 1001 0000
2:0000 0111 0111 0000
3:0000 1000 1101 0000
9:0000 1000 1101 0000
8:0000 1001 0111 0000
1:0000 1111 0011 0000
4:0000 0001 0000 1100
7:0000 0001 0000 1110
5:0000 0010 0000 1010
6:0000 0011 0000 0111
10:0000 0011 0000 0111
2:0000 0111 0000 0001
3:0000 1000 0000 1100
9:0000 1000 0000 1110
8:0000 1001 0000 1000
1:0000 1111 0000 1110
1:0000 0000 0011 1110
7:0000 0000 0011 1110
2:0000 0000 0111 0001
8:0000 0000 0111 1000
4:0000 0000 0111 1000
6:0000 0000 1001 0111
10:0000 0000 1001 0111
3:0000 0000 1101 1100
9:0000 0000 1101 1110
5:0000 0000 1110 1010
Step1. Perform radixsort in a block, and detect
equivalence classes
Step2. For each equivalence class, perform radixsort
the next block
7:0000 0001 0011 0000
4:0000 0001 1101 0000
5:0000 0010 1110 0000
6:0000 0011 1001 0000
10:0000 0011 1001 0000
2:0000 0111 0111 0000
3:0000 1000 1101 0000
9:0000 1000 1101 0000
8:0000 1001 0111 0000
1:0000 1111 0011 0000
7:0000 0001 0011 0000
4:0000 0001 1101 0000
5:0000 0010 1110 0000
6:0000 0011 1001 0000
10:0000 0011 1001 0000
2:0000 0111 0111 0000
3:0000 1000 1101 0000
9:0000 1000 1101 0000
8:0000 1001 0111 0000
1:0000 1111 0011 0000
7:0000 0001 0011 0000
4:0000 0001 1101 0000
6:0000 0011 1001 0000
10:0000 0011 1001 0000
3:0000 1000 1101 0000
9:0000 1000 1101 0000
Recursive
7:0000 0001 0011 1110
4:0100 0001 1101 1100
8:0101 1001 0111 1000
10:1001 0011 1001 0111
5:1010 0010 1110 1010
1:1011 1111 0011 1110
3:1100 1000 0000 0000
2:1101 0111 0000 0000
9:1101 1000 0000 0000
6:1111 0011 0000 0000
2:1101 0111 0000 0000
9:1101 1000 0000 0000
2:1101 0111 0011 0000
9:1101 1000 0000 0000
2:1101 0111 0000 0001
9:1101 1000 0000 1110
7:0000 0001 0011 0000
4:0000 0001 1101 0000
5:0000 0010 1110 0000
6:0000 0011 1001 0000
10:0000 0011 1001 0000
2:0000 0111 0111 0000
3:0000 1000 1101 0000
9:0000 1000 1101 0000
8:0000 1001 0111 0000
1:0000 1111 0011 0000
7:0000 0001 0011 0000
4:0000 0001 1101 0000
6:0000 0011 1001 0000
10:0000 0011 1001 0000
3:0000 1000 1101 0000
9:0000 1000 1101 0000
4:0000 0001 0011 1000
7:0000 0001 1101 1110
6:0000 0011 1001 0111
10:0000 0011 1001 0111
3:0000 1000 1101 1100
9:0000 1000 1101 1110
1:0000 0000 0011 1110
7:0000 0000 0011 1110
2:0000 0000 0111 0001
8:0000 0000 0111 1000
4:0000 0000 0111 1000
6:0000 0000 1001 0111
10:0000 0000 1001 0111
3:0000 0000 1101 1100
9:0000 0000 1101 1110
5:0000 0000 1110 1010
1:0000 0000 0011 1110
7:0000 0000 0011 1110
2:0000 0000 0111 0001
8:0000 0000 0111 1000
4:0000 0000 0111 1000
3:0000 0000 1101 1100
9:0000 0000 1101 1110
 All neighbor pairs can be
enumerated
7:0000 0001 0011 1110
4:0100 0001 1101 1100
8:0101 1001 0111 1000
10:1001 0011 1001 0111
5:1010 0010 1110 1010
1:1011 1111 0011 1110
3:1100 1000 0000 0000
2:1101 0111 0000 0000
9:1101 1000 0000 0000
6:1111 0011 0000 0000
2:1101 0111 0000 0000
9:1101 1000 0000 0000
2:1101 0111 0011 0000
9:1101 1000 0000 0000
2:1101 0111 0000 0001
9:1101 1000 0000 1110
7:0000 0001 0011 0000
4:0000 0001 1101 0000
5:0000 0010 1110 0000
6:0000 0011 1001 0000
10:0000 0011 1001 0000
2:0000 0111 0111 0000
3:0000 1000 1101 0000
9:0000 1000 1101 0000
8:0000 1001 0111 0000
1:0000 1111 0011 0000
7:0000 0001 0011 0000
4:0000 0001 1101 0000
6:0000 0011 1001 0000
10:0000 0011 1001 0000
3:0000 1000 1101 0000
9:0000 1000 1101 0000
4:0000 0001 0011 1000
7:0000 0001 1101 1110
6:0000 0011 1001 0111
10:0000 0011 1001 0111
3:0000 1000 1101 1100
9:0000 1000 1101 1110
1:0000 0000 0011 1110
7:0000 0000 0011 1110
2:0000 0000 0111 0001
8:0000 0000 0111 1000
4:0000 0000 0111 1000
6:0000 0000 1001 0111
10:0000 0000 1001 0111
3:0000 0000 1101 1100
9:0000 0000 1101 1110
5:0000 0000 1110 1010
1:0000 0000 0011 1110
7:0000 0000 0011 1110
2:0000 0000 0111 0001
8:0000 0000 0111 1000
4:0000 0000 0111 1000
3:0000 0000 1101 1100
9:0000 0000 1101 1110
 The same pair can be
detected in different block
combinations
Ex) (3,9), (6,10)
 Naïve method takes n2 memory
6:1111 0011 1001 0111
10:1001 0011 1001 0111
a b c d
6:1111 0011 1001 0111
10:1001 0011 1001 0111
Step1: Make a total order among
blocks from left to right
Step2: Make a total order among
block combinations
Step3: Take the minimum among
matching block combinations
a b c d
6:1111 0011 1001 0111
10:1001 0011 1001 0111
Combination 1 Combination 2
(a,b)<(a,c)<(a,d)
(a,b)<(a,c)<(a,d)
<(b,c)<(b,d)<(c,d)
Combination 3
a b c d
6:1111 0011 1001 0111
10:1001 0011 1001 0111
a b c d
6:1111 0011 1001 0111
10:1001 0011 1001 0111
If the number of blocks is k-d,
Eliminate duplicate pairs, and
Calculate Hamming distance
Call function to equivalence
classes
 Enumerates all neighbor pairs within a distance
- (xi, xj), i < j, Δ(xi,xj) ≦ε,
 Basic idea
- Map vector data to sketches by LSH
- Enumerate all neighbor pairs by MSM
 SketchSort with cosine LSH
- Enumerate all neighbor pairs within a consine
distance threshold ε
- (xi, xj), i < j, ε≦Δ
||||||||
1),(
ji
j
T
i
ji
xx
xx
xx 
Block level
duplication?
Check Hamming
Distance?
Chunk Level
Duplication
Check Cosine
Distance
Trash
Trash
Trash
Trash
Yes
>d
Yes
>ε
No
≦d
No
≦ε
Output
Mapping to
sketches by LSH
s1=1000101001...
s2=0110000101…
s3=0101001000…
Enumerating matching
block combinations by
MSM
Input vector data
x1=(0.1,1.2,-0.9,2.3,…)
x2=(1.5,-0.1,-1.2,-1.2,…)
x3=(-1.6,1.9,0.5,-0.6,…)
…
s6:1111 0011 1001 0111
s8:1001 0011 1001 0111
 Basic idea
- Generate a random hyperplain centered at 0
- Map each data-point to ‘1 if it is above the
hyperplain, or else ‘0’
 Repeat l times 10….
1
0
1
1
0
0….
l
 Basic idea: Map vector data to sketches and apply MSM
 Not good: create long sketches and apply MSM at once
 Divide long sketches to Q short sketches of length l
(chunks)
 Apply MSM to each chunk, obtain neighbor pairs w.r.t
Hamming distance
1100101010010101 0101010101001010 101010101010011
0101010101010101 0101000101010101 010101010010100
1001010101011110 1010101010101010 101010101010100
1000001010101010 1010101010101010 101010010101011
1111001010111110 1010101011111010 010101010101010 ......
1010101010010101 0100111101010100 001111010001011
1000010000100001 0011111101000100 010010010001111
1111000001110101 0101001010101001 001000111100101
0001010010101001 0100101011100100 101001010001000
1111000100100010 0011010100010010 010010100010001
l l
MSM MSM MSM
 Report neighbor pairs no more than a cosine distance
threshold ε
l
 A neighbor pair of sketches can be detected in
several chunks within Hamming distance d
1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 …
0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 …
HamDist(si
1,sj
1)>d
Si
Sj
HamDist(si
2,sj
2)≦d
HamDist(si
3,sj
3)>d
HamDist(si
4,sj
4)>d
HamDist(si
5,sj
5)≦d
Duplication!!
 The same pair is outputted several times
(Duplication)
Step1: Order chunks from left to right.
1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 …
0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 …a
1 2 3 4 5
Step2: Check whether left chunks are no more than
Hamming distance d
1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 …
0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 …a
1 2 3 4 5
HamDist(si
5,sj
5)≦d
HamDist(si
4,sj
4)>d?
HamDist(si
3,sj
3)>d?
HamDist(si
2,sj
2)>d?
HamDist(si
1,sj
1)>d?
- If such chunk is found, trash the pair,
Or else check cosine distance
 Call function for
each chunk
 Check duplication
four times
 Divide sketches into
equivalence classes
 Call function recursively
 True edges E*, Our results E
 Type-I error (false positive): A non-neighbor pair has
a Hamming distance within d in at least one chunk
 Type II-error (false negative): A neighbor pair has a
Hamming distance larger than d in all chunks
 Basically, type-II error is more crucial
- type-I errors are filtered out by distance calculations
 Missing edge ratio (type-II error) is bounded as
where p is an upper bound of the non-collision
probability of neighbors
 Motivation
- Large-scale data
- Needs for All Pairs Similarity Search Method
- Single Sorting Method, and its drawbacks
 Method
- Multiple Sorting Method
- Cosine Locality Sensitive Hashing
- SketchSort
 Experiments
- Comparison of other state-of-the art methods
- Use large-scale image datasets
 Two image datasets
- MNIST (60,000 data, 748 dimension)
- Tiny Image (100,000 data, 960 dimension)
 Use missing edge ratio as an evaluation measure
 Set cosine distance threshold of 0.15π
 Length of each chunk to 32bit
 Hamming distance and number of blocks are set
to (2,5) and (3,6).
 Number of chunks vary from 2, 4, 6, …, 50
 Compare our method to Lanczos bisection method
(JMLR, 2009)
 K-nearest neighbor graph construction by
SketchSort
- Keep k-nearest neighbor pairs by priority queue
 Compare SketchSort to
- Cover Tree (Beygelzimer et al., ICML 2006)
- AllKNN (Ram et al., NIPS 2009),
- Lanczos-bisection (JMLR, 2009)
 Set parameters so as to keep missing edge ratio
no more than 1.0×10-6
 Enable to detect similar pairs nearly exactly
 Take only 4.3 hours for 1.6 million images
0.05π 0.1π 0.15π
 Fast all pairs similarity search method
 Applicable to large-scale vector data
 Applicable to Euclidean distance (Raginsky,10),
Jaccard-coffecients (Broder,00)
 Various applications
 Software
- http://code.google.com/p/sketchsort/

More Related Content

Viewers also liked

Gwt presen alsip-20111201
Gwt presen alsip-20111201Gwt presen alsip-20111201
Gwt presen alsip-20111201Yasuo Tabei
 
Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Yasuo Tabei
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009Yasuo Tabei
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306Yasuo Tabei
 
WABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeYasuo Tabei
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesYasuo Tabei
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesYasuo Tabei
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 publicYasuo Tabei
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法MapR Technologies Japan
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界Preferred Networks
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)Shirou Maruyama
 
Bbq Invitation (2)
Bbq Invitation (2)Bbq Invitation (2)
Bbq Invitation (2)jazore
 
The Wonderful World of Wikis
The Wonderful World of WikisThe Wonderful World of Wikis
The Wonderful World of WikisJacquieR
 

Viewers also liked (20)

GIW2013
GIW2013GIW2013
GIW2013
 
Gwt presen alsip-20111201
Gwt presen alsip-20111201Gwt presen alsip-20111201
Gwt presen alsip-20111201
 
Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806
 
Lp Boost
Lp BoostLp Boost
Lp Boost
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
WABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTree
 
Gwt sdm public
Gwt sdm publicGwt sdm public
Gwt sdm public
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
 
bigdata2012nlp okanohara
bigdata2012nlp okanoharabigdata2012nlp okanohara
bigdata2012nlp okanohara
 
Bbq Invitation (2)
Bbq Invitation (2)Bbq Invitation (2)
Bbq Invitation (2)
 
Awakenings
AwakeningsAwakenings
Awakenings
 
Life at google
Life at googleLife at google
Life at google
 
The Wonderful World of Wikis
The Wonderful World of WikisThe Wonderful World of Wikis
The Wonderful World of Wikis
 

Similar to Sketch sort sugiyamalab-20101026 - public

Numerical Methods: curve fitting and interpolation
Numerical Methods: curve fitting and interpolationNumerical Methods: curve fitting and interpolation
Numerical Methods: curve fitting and interpolationNikolai Priezjev
 
Abductive learning of quantized stochastic processes
Abductive learning of quantized stochastic processesAbductive learning of quantized stochastic processes
Abductive learning of quantized stochastic processesIshanu Chattopadhyay
 
Sauron: DIY home security with Ruby!
Sauron: DIY home security with Ruby!Sauron: DIY home security with Ruby!
Sauron: DIY home security with Ruby!1337807
 
Estado del Arte de la IA
Estado del Arte de la IAEstado del Arte de la IA
Estado del Arte de la IAPlain Concepts
 
Ashish thusoo evolution of big data architectures
Ashish thusoo   evolution of big data architecturesAshish thusoo   evolution of big data architectures
Ashish thusoo evolution of big data architecturesdrewz lin
 
Modelling monthly rainfall time series using Markov Chains
Modelling monthly rainfall time series using Markov ChainsModelling monthly rainfall time series using Markov Chains
Modelling monthly rainfall time series using Markov ChainsAmro Elfeki
 
Geohydrology ii (3)
Geohydrology ii (3)Geohydrology ii (3)
Geohydrology ii (3)Amro Elfeki
 
PosterFormatRNYF(1)
PosterFormatRNYF(1)PosterFormatRNYF(1)
PosterFormatRNYF(1)Usman Khalid
 
"Deep Learning" Chap.6 Convolutional Neural Net
"Deep Learning" Chap.6 Convolutional Neural Net"Deep Learning" Chap.6 Convolutional Neural Net
"Deep Learning" Chap.6 Convolutional Neural NetKen'ichi Matsui
 
How Computer Games Help Children Learn (Stockholm University Dept of Educatio...
How Computer Games Help Children Learn (Stockholm University Dept of Educatio...How Computer Games Help Children Learn (Stockholm University Dept of Educatio...
How Computer Games Help Children Learn (Stockholm University Dept of Educatio...dws1d
 
Hash Functions, the MD5 Algorithm and the Future (SHA-3)
Hash Functions, the MD5 Algorithm and the Future (SHA-3)Hash Functions, the MD5 Algorithm and the Future (SHA-3)
Hash Functions, the MD5 Algorithm and the Future (SHA-3)Dylan Field
 
Pratt truss optimization using
Pratt truss optimization usingPratt truss optimization using
Pratt truss optimization usingHarish Kant Soni
 
Efficient Numerical PDE Methods to Solve Calibration and Pricing Problems in ...
Efficient Numerical PDE Methods to Solve Calibration and Pricing Problems in ...Efficient Numerical PDE Methods to Solve Calibration and Pricing Problems in ...
Efficient Numerical PDE Methods to Solve Calibration and Pricing Problems in ...Volatility
 
Find nuclei in images with U-net
Find nuclei in images with U-netFind nuclei in images with U-net
Find nuclei in images with U-netDing Li
 
Tony Rees: The c-squares nested global grid
Tony Rees: The c-squares nested global gridTony Rees: The c-squares nested global grid
Tony Rees: The c-squares nested global gridTony Rees
 
Variation and Quality (2.008x Lecture Slides)
Variation and Quality (2.008x Lecture Slides)Variation and Quality (2.008x Lecture Slides)
Variation and Quality (2.008x Lecture Slides)A. John Hart
 

Similar to Sketch sort sugiyamalab-20101026 - public (20)

05 2 관계논리비트연산
05 2 관계논리비트연산05 2 관계논리비트연산
05 2 관계논리비트연산
 
Numerical Methods: curve fitting and interpolation
Numerical Methods: curve fitting and interpolationNumerical Methods: curve fitting and interpolation
Numerical Methods: curve fitting and interpolation
 
Abductive learning of quantized stochastic processes
Abductive learning of quantized stochastic processesAbductive learning of quantized stochastic processes
Abductive learning of quantized stochastic processes
 
Sauron: DIY home security with Ruby!
Sauron: DIY home security with Ruby!Sauron: DIY home security with Ruby!
Sauron: DIY home security with Ruby!
 
Estado del Arte de la IA
Estado del Arte de la IAEstado del Arte de la IA
Estado del Arte de la IA
 
Project lfsr
Project lfsrProject lfsr
Project lfsr
 
UCSD NANO106 - 02 - 3D Bravis Lattices and Lattice Computations
UCSD NANO106 - 02 - 3D Bravis Lattices and Lattice ComputationsUCSD NANO106 - 02 - 3D Bravis Lattices and Lattice Computations
UCSD NANO106 - 02 - 3D Bravis Lattices and Lattice Computations
 
Ashish thusoo evolution of big data architectures
Ashish thusoo   evolution of big data architecturesAshish thusoo   evolution of big data architectures
Ashish thusoo evolution of big data architectures
 
Modelling monthly rainfall time series using Markov Chains
Modelling monthly rainfall time series using Markov ChainsModelling monthly rainfall time series using Markov Chains
Modelling monthly rainfall time series using Markov Chains
 
Geohydrology ii (3)
Geohydrology ii (3)Geohydrology ii (3)
Geohydrology ii (3)
 
PosterFormatRNYF(1)
PosterFormatRNYF(1)PosterFormatRNYF(1)
PosterFormatRNYF(1)
 
"Deep Learning" Chap.6 Convolutional Neural Net
"Deep Learning" Chap.6 Convolutional Neural Net"Deep Learning" Chap.6 Convolutional Neural Net
"Deep Learning" Chap.6 Convolutional Neural Net
 
How Computer Games Help Children Learn (Stockholm University Dept of Educatio...
How Computer Games Help Children Learn (Stockholm University Dept of Educatio...How Computer Games Help Children Learn (Stockholm University Dept of Educatio...
How Computer Games Help Children Learn (Stockholm University Dept of Educatio...
 
Spatial filtering
Spatial filteringSpatial filtering
Spatial filtering
 
Hash Functions, the MD5 Algorithm and the Future (SHA-3)
Hash Functions, the MD5 Algorithm and the Future (SHA-3)Hash Functions, the MD5 Algorithm and the Future (SHA-3)
Hash Functions, the MD5 Algorithm and the Future (SHA-3)
 
Pratt truss optimization using
Pratt truss optimization usingPratt truss optimization using
Pratt truss optimization using
 
Efficient Numerical PDE Methods to Solve Calibration and Pricing Problems in ...
Efficient Numerical PDE Methods to Solve Calibration and Pricing Problems in ...Efficient Numerical PDE Methods to Solve Calibration and Pricing Problems in ...
Efficient Numerical PDE Methods to Solve Calibration and Pricing Problems in ...
 
Find nuclei in images with U-net
Find nuclei in images with U-netFind nuclei in images with U-net
Find nuclei in images with U-net
 
Tony Rees: The c-squares nested global grid
Tony Rees: The c-squares nested global gridTony Rees: The c-squares nested global grid
Tony Rees: The c-squares nested global grid
 
Variation and Quality (2.008x Lecture Slides)
Variation and Quality (2.008x Lecture Slides)Variation and Quality (2.008x Lecture Slides)
Variation and Quality (2.008x Lecture Slides)
 

Sketch sort sugiyamalab-20101026 - public

  • 1. Yasuo Tabei(JST Minato ERATO Project) Joint work with Takeaki Uno(NII), Masashi Sugiyama (TITECH), Koji Tsuda (AIST) Seminar@Sugiyamalab, October 26, 2010(Tue.)
  • 2.  Motivation - Development of large-scale data - Needs for all pairs similarity search method - Single sorting method and its drawbacks  Method - Multiple sorting method - Locality sensitive hashing - SketchSort  Experiments - Comparison of other state-of-the art methods - Use large-scale image datasets
  • 3.  Images - flicker, image search etc - 80 million tiny images (Torralba et al., (2008)) Genome Sequences - NCBI Sequence Read Archive - A large-scale genome sequences from various organisms Chemical Compounds - NCBI PubChem - 28 million chemical compounds
  • 4. x=(0.3, -0.3, 0.5, 1.2, …) Image Chemical Compound x=(1, 0, 1, 0, 0, 0, 1, …) Vector Text, Protein, DNA/RNA etc sift fingerprint
  • 5.  Mapping vector to binary string (sketch) - Conserve the distance in the original space x=(0.3, 0.1, 0.5, 0.6, 0.7, 1.2, -0.2,…) s=1010010001110001010…  Advantage - Can keep giga-scale data in main memory - Accelerate various algorithms Mapping
  • 6.  Finding all neighbor pairs from vector data - Given a set of data-points - Find all pairs within a distance , s.t. ji xx , ε Δ ),( ji xx ε
  • 7.  Can build a neighborhood graph - Vertex: a data-point - Edge : a neighbor pair Applications: semi-supervised learning, spectral clustering, ROI detection in images, retrieval of protein sequences, etc ε
  • 8.  Finding neighbor pairs by sorting  Map vector data to skeches (a)Input 1:101111 2:110101 3:110010 4:010000 5:101000 6:111100 7:000000 8:010110 9:110110 10:100100 (b) Sort 7:000000 4:010000 8:010110 10:100100 5:101000 1:101111 3:110010 2:110101 9:110110 6:111100 (c) Scan neighbors 7:000000 4:010000 8:010110 10:100100 5:101000 1:101111 3:110010 2:110101 9:110110 6:111100 w w
  • 9.  Need a large number of distance calcuration for achieving reasonable accuracy  Can not derive an analytical estimation of the fraction of missing neighbors
  • 10.  Motivation - Large-scale data - Needs for all pairs similarity search method - Single sorting method, and its drawbacks  Method - Multiple sorting method - Locality sensitive hashing - SketchSort  Experiments - Comparison of other state-of-the art methods - Use large-scale image datasets
  • 11.  Input: set of fixed-length strings S={s1,…,sn }  Output: all pairs of strings within a Hamming distance d  By appling radixsort, enumerate all pairs in O(n+m) - n: number of strings, m: output pairs  Introduce block-wise masking technique for acceleration
  • 12. Block level duplication? Check Hamming Distance? Trash Trash Yes >d No ≦d Output Enumerate all pairs which share matching block combinations by MSM Input fixed length strings s1=01001001001010100 s2=11010010010101010 s3=00010100100100100 … s6:1111 0011 1001 0111 s8:1001 0011 1001 0111
  • 13.  Sort strings by radixsort, divide strings into equivalence classes O(n)  Draw edges within all strings in an equivalece class O(m)  Computational Complexity: O(n+m) EMILY DAVID CHRIS ALICE DAVID BOBBY DAVID ALICE Sort ALICE ALICE BOBBY CHRIS DAVID DAVID DAVID EMILY Equivalence Classes
  • 14.  Mask d characters in all possible ways  Performe radixsort times  Linear time to the number of strings  Time exponential to d, polynomial to the length of strings l  Ex)d=2       d l 7:0000 0001 0011 1110 4:0100 0001 1101 1100 8:0101 1001 0111 1000 10:1001 0011 1001 0111 5:1010 0010 1110 1010 1:1011 1111 0011 1110 2:1101 0111 0111 0001 3:1100 1000 1101 1100 9:1101 1000 1101 1110 6:1111 0011 1001 0111 7:0000 0001 0011 1110 4:0100 0001 1101 1100 8:0101 1001 0111 1000 5:1010 0010 1110 1010 3:1100 1000 1101 1100 6:1111 0011 1001 0111 10:1001 0011 1001 0111 2:1101 0111 0111 0001 9:1101 1000 1101 1110 1:1011 1111 0011 1110
  • 15.  Mask d blocks in all possible ways  Can reduce the number of sorting operations  Non-neighbor might be detected  Filter out by calcurating actual Hamming distance  Ex)d=2 7:0000 0001 0011 1110 4:0100 0001 1101 1100 8:0101 1001 0111 1000 10:1001 0011 1001 0111 5:1010 0010 1110 1010 1:1011 1111 0011 1110 3:1100 1000 0000 0000 2:1101 0111 0000 0000 9:1101 1000 0000 0000 6:1111 0011 0000 0000 7:0000 0001 0011 1110 4:0100 0001 1101 1100 8:0101 1001 0111 1000 10:1001 0011 1001 0111 5:1010 0000 1110 0000 1:1011 0000 0011 0000 3:1100 0000 1101 0000 2:1101 0000 0111 0000 9:1101 0000 1101 0000 6:1111 0000 1001 0000 7:0000 0000 0000 1110 4:0100 0000 0000 1100 8:0101 0000 0000 1000 10:1001 0000 0000 0111 5:1010 0000 0000 1010 1:1011 0000 0000 1110 3:1100 0000 0000 1100 2:1101 0000 0000 0001 9:1101 0000 0000 1110 6:1111 0000 0000 0111 7:0000 0001 0011 0000 4:0000 0001 1101 0000 5:0000 0010 1110 0000 6:0000 0011 1001 0000 10:0000 0011 1001 0000 2:0000 0111 0111 0000 3:0000 1000 1101 0000 9:0000 1000 1101 0000 8:0000 1001 0111 0000 1:0000 1111 0011 0000 4:0000 0001 0000 1100 7:0000 0001 0000 1110 5:0000 0010 0000 1010 6:0000 0011 0000 0111 10:0000 0011 0000 0111 2:0000 0111 0000 0001 3:0000 1000 0000 1100 9:0000 1000 0000 1110 8:0000 1001 0000 1000 1:0000 1111 0000 1110 1:0000 0000 0011 1110 7:0000 0000 0011 1110 2:0000 0000 0111 0001 8:0000 0000 0111 1000 4:0000 0000 0111 1000 6:0000 0000 1001 0111 10:0000 0000 1001 0111 3:0000 0000 1101 1100 9:0000 0000 1101 1110 5:0000 0000 1110 1010
  • 16. Step1. Perform radixsort in a block, and detect equivalence classes Step2. For each equivalence class, perform radixsort the next block 7:0000 0001 0011 0000 4:0000 0001 1101 0000 5:0000 0010 1110 0000 6:0000 0011 1001 0000 10:0000 0011 1001 0000 2:0000 0111 0111 0000 3:0000 1000 1101 0000 9:0000 1000 1101 0000 8:0000 1001 0111 0000 1:0000 1111 0011 0000 7:0000 0001 0011 0000 4:0000 0001 1101 0000 5:0000 0010 1110 0000 6:0000 0011 1001 0000 10:0000 0011 1001 0000 2:0000 0111 0111 0000 3:0000 1000 1101 0000 9:0000 1000 1101 0000 8:0000 1001 0111 0000 1:0000 1111 0011 0000 7:0000 0001 0011 0000 4:0000 0001 1101 0000 6:0000 0011 1001 0000 10:0000 0011 1001 0000 3:0000 1000 1101 0000 9:0000 1000 1101 0000 Recursive
  • 17. 7:0000 0001 0011 1110 4:0100 0001 1101 1100 8:0101 1001 0111 1000 10:1001 0011 1001 0111 5:1010 0010 1110 1010 1:1011 1111 0011 1110 3:1100 1000 0000 0000 2:1101 0111 0000 0000 9:1101 1000 0000 0000 6:1111 0011 0000 0000 2:1101 0111 0000 0000 9:1101 1000 0000 0000 2:1101 0111 0011 0000 9:1101 1000 0000 0000 2:1101 0111 0000 0001 9:1101 1000 0000 1110 7:0000 0001 0011 0000 4:0000 0001 1101 0000 5:0000 0010 1110 0000 6:0000 0011 1001 0000 10:0000 0011 1001 0000 2:0000 0111 0111 0000 3:0000 1000 1101 0000 9:0000 1000 1101 0000 8:0000 1001 0111 0000 1:0000 1111 0011 0000 7:0000 0001 0011 0000 4:0000 0001 1101 0000 6:0000 0011 1001 0000 10:0000 0011 1001 0000 3:0000 1000 1101 0000 9:0000 1000 1101 0000 4:0000 0001 0011 1000 7:0000 0001 1101 1110 6:0000 0011 1001 0111 10:0000 0011 1001 0111 3:0000 1000 1101 1100 9:0000 1000 1101 1110 1:0000 0000 0011 1110 7:0000 0000 0011 1110 2:0000 0000 0111 0001 8:0000 0000 0111 1000 4:0000 0000 0111 1000 6:0000 0000 1001 0111 10:0000 0000 1001 0111 3:0000 0000 1101 1100 9:0000 0000 1101 1110 5:0000 0000 1110 1010 1:0000 0000 0011 1110 7:0000 0000 0011 1110 2:0000 0000 0111 0001 8:0000 0000 0111 1000 4:0000 0000 0111 1000 3:0000 0000 1101 1100 9:0000 0000 1101 1110  All neighbor pairs can be enumerated
  • 18. 7:0000 0001 0011 1110 4:0100 0001 1101 1100 8:0101 1001 0111 1000 10:1001 0011 1001 0111 5:1010 0010 1110 1010 1:1011 1111 0011 1110 3:1100 1000 0000 0000 2:1101 0111 0000 0000 9:1101 1000 0000 0000 6:1111 0011 0000 0000 2:1101 0111 0000 0000 9:1101 1000 0000 0000 2:1101 0111 0011 0000 9:1101 1000 0000 0000 2:1101 0111 0000 0001 9:1101 1000 0000 1110 7:0000 0001 0011 0000 4:0000 0001 1101 0000 5:0000 0010 1110 0000 6:0000 0011 1001 0000 10:0000 0011 1001 0000 2:0000 0111 0111 0000 3:0000 1000 1101 0000 9:0000 1000 1101 0000 8:0000 1001 0111 0000 1:0000 1111 0011 0000 7:0000 0001 0011 0000 4:0000 0001 1101 0000 6:0000 0011 1001 0000 10:0000 0011 1001 0000 3:0000 1000 1101 0000 9:0000 1000 1101 0000 4:0000 0001 0011 1000 7:0000 0001 1101 1110 6:0000 0011 1001 0111 10:0000 0011 1001 0111 3:0000 1000 1101 1100 9:0000 1000 1101 1110 1:0000 0000 0011 1110 7:0000 0000 0011 1110 2:0000 0000 0111 0001 8:0000 0000 0111 1000 4:0000 0000 0111 1000 6:0000 0000 1001 0111 10:0000 0000 1001 0111 3:0000 0000 1101 1100 9:0000 0000 1101 1110 5:0000 0000 1110 1010 1:0000 0000 0011 1110 7:0000 0000 0011 1110 2:0000 0000 0111 0001 8:0000 0000 0111 1000 4:0000 0000 0111 1000 3:0000 0000 1101 1100 9:0000 0000 1101 1110  The same pair can be detected in different block combinations Ex) (3,9), (6,10)  Naïve method takes n2 memory 6:1111 0011 1001 0111 10:1001 0011 1001 0111
  • 19. a b c d 6:1111 0011 1001 0111 10:1001 0011 1001 0111 Step1: Make a total order among blocks from left to right Step2: Make a total order among block combinations Step3: Take the minimum among matching block combinations a b c d 6:1111 0011 1001 0111 10:1001 0011 1001 0111 Combination 1 Combination 2 (a,b)<(a,c)<(a,d) (a,b)<(a,c)<(a,d) <(b,c)<(b,d)<(c,d) Combination 3 a b c d 6:1111 0011 1001 0111 10:1001 0011 1001 0111 a b c d 6:1111 0011 1001 0111 10:1001 0011 1001 0111
  • 20. If the number of blocks is k-d, Eliminate duplicate pairs, and Calculate Hamming distance Call function to equivalence classes
  • 21.  Enumerates all neighbor pairs within a distance - (xi, xj), i < j, Δ(xi,xj) ≦ε,  Basic idea - Map vector data to sketches by LSH - Enumerate all neighbor pairs by MSM  SketchSort with cosine LSH - Enumerate all neighbor pairs within a consine distance threshold ε - (xi, xj), i < j, ε≦Δ |||||||| 1),( ji j T i ji xx xx xx 
  • 22. Block level duplication? Check Hamming Distance? Chunk Level Duplication Check Cosine Distance Trash Trash Trash Trash Yes >d Yes >ε No ≦d No ≦ε Output Mapping to sketches by LSH s1=1000101001... s2=0110000101… s3=0101001000… Enumerating matching block combinations by MSM Input vector data x1=(0.1,1.2,-0.9,2.3,…) x2=(1.5,-0.1,-1.2,-1.2,…) x3=(-1.6,1.9,0.5,-0.6,…) … s6:1111 0011 1001 0111 s8:1001 0011 1001 0111
  • 23.  Basic idea - Generate a random hyperplain centered at 0 - Map each data-point to ‘1 if it is above the hyperplain, or else ‘0’  Repeat l times 10…. 1 0 1 1 0 0…. l
  • 24.  Basic idea: Map vector data to sketches and apply MSM  Not good: create long sketches and apply MSM at once  Divide long sketches to Q short sketches of length l (chunks)  Apply MSM to each chunk, obtain neighbor pairs w.r.t Hamming distance 1100101010010101 0101010101001010 101010101010011 0101010101010101 0101000101010101 010101010010100 1001010101011110 1010101010101010 101010101010100 1000001010101010 1010101010101010 101010010101011 1111001010111110 1010101011111010 010101010101010 ...... 1010101010010101 0100111101010100 001111010001011 1000010000100001 0011111101000100 010010010001111 1111000001110101 0101001010101001 001000111100101 0001010010101001 0100101011100100 101001010001000 1111000100100010 0011010100010010 010010100010001 l l MSM MSM MSM  Report neighbor pairs no more than a cosine distance threshold ε l
  • 25.  A neighbor pair of sketches can be detected in several chunks within Hamming distance d 1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 … 0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 … HamDist(si 1,sj 1)>d Si Sj HamDist(si 2,sj 2)≦d HamDist(si 3,sj 3)>d HamDist(si 4,sj 4)>d HamDist(si 5,sj 5)≦d Duplication!!  The same pair is outputted several times (Duplication)
  • 26. Step1: Order chunks from left to right. 1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 … 0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 …a 1 2 3 4 5 Step2: Check whether left chunks are no more than Hamming distance d 1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 … 0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 …a 1 2 3 4 5 HamDist(si 5,sj 5)≦d HamDist(si 4,sj 4)>d? HamDist(si 3,sj 3)>d? HamDist(si 2,sj 2)>d? HamDist(si 1,sj 1)>d? - If such chunk is found, trash the pair, Or else check cosine distance
  • 27.  Call function for each chunk  Check duplication four times  Divide sketches into equivalence classes  Call function recursively
  • 28.  True edges E*, Our results E  Type-I error (false positive): A non-neighbor pair has a Hamming distance within d in at least one chunk  Type II-error (false negative): A neighbor pair has a Hamming distance larger than d in all chunks
  • 29.  Basically, type-II error is more crucial - type-I errors are filtered out by distance calculations  Missing edge ratio (type-II error) is bounded as where p is an upper bound of the non-collision probability of neighbors
  • 30.  Motivation - Large-scale data - Needs for All Pairs Similarity Search Method - Single Sorting Method, and its drawbacks  Method - Multiple Sorting Method - Cosine Locality Sensitive Hashing - SketchSort  Experiments - Comparison of other state-of-the art methods - Use large-scale image datasets
  • 31.  Two image datasets - MNIST (60,000 data, 748 dimension) - Tiny Image (100,000 data, 960 dimension)  Use missing edge ratio as an evaluation measure  Set cosine distance threshold of 0.15π  Length of each chunk to 32bit  Hamming distance and number of blocks are set to (2,5) and (3,6).  Number of chunks vary from 2, 4, 6, …, 50  Compare our method to Lanczos bisection method (JMLR, 2009)
  • 32.
  • 33.
  • 34.  K-nearest neighbor graph construction by SketchSort - Keep k-nearest neighbor pairs by priority queue  Compare SketchSort to - Cover Tree (Beygelzimer et al., ICML 2006) - AllKNN (Ram et al., NIPS 2009), - Lanczos-bisection (JMLR, 2009)
  • 35.
  • 36.
  • 37.  Set parameters so as to keep missing edge ratio no more than 1.0×10-6  Enable to detect similar pairs nearly exactly  Take only 4.3 hours for 1.6 million images 0.05π 0.1π 0.15π
  • 38.  Fast all pairs similarity search method  Applicable to large-scale vector data  Applicable to Euclidean distance (Raginsky,10), Jaccard-coffecients (Broder,00)  Various applications  Software - http://code.google.com/p/sketchsort/