Sketch sort ochadai20101015-public

2,083 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,083
On SlideShare
0
From Embeds
0
Number of Embeds
241
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Sketch sort ochadai20101015-public

  1. 1. Seminer@ochanomizu university , Octorber 14, 2010(Thu.) Yasuo Tabei(JST Minato ERATO Project) Joint work with Takeaki Uno(NII), Masashi Sugiyama (TITECH), Koji Tsuda (AIST)
  2. 2.  Motivation - Large-scale data - Needs for all pairs similarity search method - Single sorting sethod, and its drawbacks  Method - Multiple sorting method - Locality sensitive hashing - SketchSort  Experiments - Comparison of other state-of-the art methods - Use large-scale image datasets
  3. 3.  Image Chemical Compounds - 80 million tiny images - NCBI PubChem (Torralba et al., (2008)) - 28 million chemical - size: 32×32 pixes compounds Genome Sequences - NCBI Sequence Read Archive - A large-scale genome sequences from various organisms
  4. 4. Image sift Vector x=(0.3, -0.3, 0.5, 1.2, …) Chemical Compound fingerprint x=(1, 0, 1, 0, 0, 0, 1, …) Text, Protein, DNA/RNA etc
  5. 5.  Mapping vector to binary string (sketch) - Conserve the distance in the original space x=(0.3, 0.1, 0.5, 0.6, 0.7, 1.2, -0.2,…) Mapping s=1010010001110001010…  Advantages - Can keep giga-scale data in main memory - Accelerate various algorithms
  6. 6.  Finding all neighbor pairs from vector data - Given a set of data-points - Find all pairs within a distance , xi , x j   Δ( xi , x j  ε s.t. ) ε
  7. 7.  Can build a neighborhood graph - Vertex: a data-point - Edge : a neighbor pair Applications: semi-supervised learning, spectral clustering, ROI detection in images, retrieval of protein sequences, etc ε
  8. 8.  Finding neighbor pairs by sorting  Map vector data to skeches (a)Input (b) Sort (c) Scan neighbors 1:101111 7:000000 7:000000 2:110101 4:010000 4:010000 3:110010 8:010110 8:010110 w 4:010000 10:100100 10:100100 5:101000 5:101000 5:101000 6:111100 1:101111 1:101111 7:000000 3:110010 3:110010 w 8:010110 2:110101 2:110101 9:110110 9:110110 9:110110 10:100100 6:111100 6:111100
  9. 9.  Need a large number of distance calcuration for achieving reasonable accuracy  Can not derive an analytical estimation of the fraction of missing neighbors
  10. 10.  Motivation - Large-scale data - Needs for all pairs similarity search method - Single sorting sethod, and its drawbacks  Method - Multiple sorting method - Locality sensitive hashing - SketchSort  Experiments - Comparison of other state-of-the art methods - Use large-scale image datasets
  11. 11.  Input: set of fixed-length strings S={s1,…,sn }  Output: all pairs of strings within a Hamming distance d  By appling radixsort, enumerate all pairs in O(n+m) - n: number of strings, m: output pairs  Introduce block-wise masking technique for acceleration
  12. 12.  Sort strings by radixsort, divide strings into equivalence classes O(n)  Draw edges within all strings in an equivalece class O(m)  Computational Complexity: O(n+m) EMILY ALICE DAVID ALICE CHRIS BOBBY ALICE Sort CHRIS Equivalence DAVID DAVID Classes BOBBY DAVID DAVID DAVID ALICE EMILY
  13. 13.  Mask d characters in all possible ways l   Performe radixsort  d  times      Linear time to the number of strings  Time exponential to d, polynomial to the length of strings l  Ex)d=2 7:0000 0001 0011 1110 7:0000 0001 0011 1110 4:0100 0001 1101 1100 4:0100 0001 1101 1100 8:0101 1001 0111 1000 8:0101 1001 0111 1000 10:1001 0011 1001 0111 5:1010 0010 1110 1010 5:1010 0010 1110 1010 3:1100 1000 1101 1100 1:1011 1111 0011 1110 6:1111 0011 1001 0111 2:1101 0111 0111 0001 10:1001 0011 1001 0111 3:1100 1000 1101 1100 2:1101 0111 0111 0001 9:1101 1000 1101 1110 9:1101 1000 1101 1110 6:1111 0011 1001 0111 1:1011 1111 0011 1110
  14. 14.  Mask d blocks in all possible ways  Can reduce the number of sorting operations  Non-neighbor might be detected  Filter out by calcurating actual Hamming distance  Ex)d=2 7:0000 0001 0011 1110 7:0000 0001 0011 1110 7:0000 0000 0000 1110 4:0100 0001 1101 1100 4:0100 0001 1101 1100 4:0100 0000 0000 1100 8:0101 1001 0111 1000 8:0101 1001 0111 1000 8:0101 0000 0000 1000 10:1001 0011 1001 0111 10:1001 0011 1001 0111 10:1001 0000 0000 0111 5:1010 0010 1110 1010 5:1010 0000 1110 0000 5:1010 0000 0000 1010 1:1011 1111 0011 1110 1:1011 0000 0011 0000 1:1011 0000 0000 1110 3:1100 1000 0000 0000 3:1100 0000 1101 0000 3:1100 0000 0000 1100 2:1101 0111 0000 0000 2:1101 0000 0111 0000 2:1101 0000 0000 0001 9:1101 1000 0000 0000 9:1101 0000 1101 0000 9:1101 0000 0000 1110 6:1111 0011 0000 0000 6:1111 0000 1001 0000 6:1111 0000 0000 0111 7:0000 0001 0011 0000 4:0000 0001 0000 1100 1:0000 0000 0011 1110 4:0000 0001 1101 0000 7:0000 0001 0000 1110 7:0000 0000 0011 1110 5:0000 0010 1110 0000 5:0000 0010 0000 1010 2:0000 0000 0111 0001 6:0000 0011 1001 0000 6:0000 0011 0000 0111 8:0000 0000 0111 1000 10:0000 0011 1001 0000 10:0000 0011 0000 0111 4:0000 0000 0111 1000 2:0000 0111 0111 0000 2:0000 0111 0000 0001 6:0000 0000 1001 0111 3:0000 1000 1101 0000 3:0000 1000 0000 1100 10:0000 0000 1001 0111 9:0000 1000 1101 0000 9:0000 1000 0000 1110 3:0000 0000 1101 1100 8:0000 1001 0111 0000 8:0000 1001 0000 1000 9:0000 0000 1101 1110 1:0000 1111 0011 0000 1:0000 1111 0000 1110 5:0000 0000 1110 1010
  15. 15. Step1. Perform radixsort in a block, and detect equivalence classes Step2. For each equivalence class, perform radixsort the next block Recursive 7:0000 0001 0011 0000 7:0000 0001 0011 0000 7:0000 0001 0011 0000 4:0000 0001 1101 0000 4:0000 0001 1101 0000 4:0000 0001 1101 0000 5:0000 0010 1110 0000 5:0000 0010 1110 0000 6:0000 0011 1001 0000 6:0000 0011 1001 0000 6:0000 0011 1001 0000 10:0000 0011 1001 0000 10:0000 0011 1001 0000 10:0000 0011 1001 0000 2:0000 0111 0111 0000 2:0000 0111 0111 0000 3:0000 1000 1101 0000 3:0000 1000 1101 0000 9:0000 1000 1101 0000 9:0000 1000 1101 0000 8:0000 1001 0111 0000 8:0000 1001 0111 0000 3:0000 1000 1101 0000 1:0000 1111 0011 0000 1:0000 1111 0011 0000 9:0000 1000 1101 0000
  16. 16.  All neighbor pairs can be 7:0000 0001 0011 1110 enumerated 4:0100 0001 1101 1100 2:1101 0111 0000 0000 8:0101 1001 0111 1000 9:1101 1000 0000 0000 10:1001 0011 1001 0111 5:1010 0010 1110 1010 2:1101 0111 0011 0000 1:1011 1111 0011 1110 9:1101 1000 0000 0000 3:1100 1000 0000 0000 2:1101 0111 0000 0000 2:1101 0111 0000 0001 9:1101 1000 0000 0000 9:1101 1000 0000 1110 6:1111 0011 0000 0000 7:0000 0001 0011 0000 4:0000 0001 1101 0000 6:0000 0011 1001 0000 10:0000 0011 1001 0000 7:0000 0001 0011 0000 1:0000 0000 0011 1110 1:0000 0000 0011 1110 7:0000 0000 0011 1110 4:0000 0001 1101 0000 3:0000 1000 1101 0000 7:0000 0000 0011 1110 5:0000 0010 1110 0000 9:0000 1000 1101 0000 2:0000 0000 0111 0001 2:0000 0000 0111 0001 6:0000 0011 1001 0000 8:0000 0000 0111 1000 8:0000 0000 0111 1000 10:0000 0011 1001 0000 4:0000 0001 0011 1000 4:0000 0000 0111 1000 4:0000 0000 0111 1000 2:0000 0111 0111 0000 7:0000 0001 1101 1110 6:0000 0000 1001 0111 3:0000 1000 1101 0000 6:0000 0011 1001 0111 10:0000 0000 1001 0111 9:0000 1000 1101 0000 3:0000 0000 1101 1100 10:0000 0011 1001 0111 3:0000 0000 1101 1100 8:0000 1001 0111 0000 9:0000 0000 1101 1110 9:0000 0000 1101 1110 1:0000 1111 0011 0000 3:0000 1000 1101 1100 5:0000 0000 1110 1010 9:0000 1000 1101 1110
  17. 17.  The same pair may be 7:0000 0001 0011 1110 detcted in different block 4:0100 0001 1101 1100 2:1101 0111 0000 0000 8:0101 1001 0111 1000 9:1101 1000 0000 0000 combinations 10:1001 0011 1001 0111 5:1010 0010 1110 1010 2:1101 0111 0011 0000 Ex) (3,9), (6,10) 1:1011 1111 0011 1110 9:1101 1000 0000 0000 3:1100 1000 0000 0000  Naïve method takes n2 memory 2:1101 0111 0000 0000 2:1101 0111 0000 0001 9:1101 1000 0000 0000 9:1101 1000 0000 1110 6:1111 0011 0000 0000 7:0000 0001 0011 0000 4:0000 0001 1101 0000 6:0000 0011 1001 0000 1:0000 0000 0011 1110 10:0000 0011 1001 0000 1:0000 0000 0011 1110 7:0000 0000 0011 1110 7:0000 0001 0011 0000 7:0000 0000 0011 1110 4:0000 0001 1101 0000 3:0000 1000 1101 0000 2:0000 0000 0111 0001 2:0000 0000 0111 0001 5:0000 0010 1110 0000 9:0000 1000 1101 0000 8:0000 0000 0111 1000 8:0000 0000 0111 1000 6:0000 0011 1001 0000 4:0000 0000 0111 1000 4:0000 0000 0111 1000 10:0000 0011 1001 0000 4:0000 0001 0011 1000 6:0000 0000 1001 0111 2:0000 0111 0111 0000 7:0000 0001 1101 1110 6:1111 0011 1001 0111 10:0000 0000 1001 0111 3:0000 1000 1101 0000 10:1001 0011 1001 0111 6:0000 0011 1001 0111 3:0000 0000 1101 1100 9:0000 1000 1101 0000 9:0000 0000 1101 1110 10:0000 0011 1001 0111 3:0000 0000 1101 1100 8:0000 1001 0111 0000 5:0000 0000 1110 1010 1:0000 1111 0011 0000 9:0000 0000 1101 1110 3:0000 1000 1101 1100 9:0000 1000 1101 1110
  18. 18. 1 2 3 4 Step1: Make a total order among 6:1111 0011 1001 0111 blocks from left to right 10:1001 0011 1001 0111 Step2: Make a total order among block combinations (1,2)<(1,3)<(1,4) <(2,3)<(2,4)<(3,4) Step3: Take the minimum among matched block combinations Combination 1 Combination 2 1 2 3 4 1 2 3 4 6:1111 0011 1001 0111 6:1111 0011 1001 0111 10:1001 0011 1001 0111 10:1001 0011 1001 0111 (1,2) < (1,4)
  19. 19. If the number of blocks is k-d, Eliminate duplicate pairs, and Calculate Hamming distance Call function to equivalence classes
  20. 20.  Enumerates all neighbor pairs within a distance - (xi, xj), i < j, Δ(xi,xj) ≦ε,  Basic idea - Map vector data to sketches by LSH - Enumerate all neighbor pairs by MSM  SketchSort with cosine LSH - Enumerate all neighbor pairs within a consine distance threshold ε xiT x j - (xi, xj), i < j, Δ( xi , x j )  1  ≦ε || xi |||| x j ||  Applicable to Euclidean distance (Raginsky,10), Jaccard-coffecients
  21. 21.  Basic idea  Generate a random hyperplain centered at 0  Map vector data to ‘1 if a data-points is above the hyperplain, or else ‘0’  Repeat l times 10…. l 1 1 1 0…. 0 0
  22. 22.  Basic idea: Map vector data to sketches and apply MSM  Not good: create long sketches and apply MSM at once  Divide long sketches to Q short sketches of length l (chunks)  Apply MSM to each chunk, obtain neighbor pairs w.r.t Hamming distance l l l 1100101010010101 0101010101001010 101010101010011 0101010101010101 0101000101010101 010101010010100 1001010101011110 1010101010101010 101010101010100 1000001010101010 1010101010101010 101010010101011 1111001010111110 1010101011111010 010101010101010 ...... 1010101010010101 0100111101010100 001111010001011 1000010000100001 0011111101000100 010010010001111 1111000001110101 0101001010101001 001000111100101 0001010010101001 0100101011100100 101001010001000 1111000100100010 0011010100010010 010010100010001 MSM MSM MSM  Report neighbor pairs no more than a cosine distance threshold ε
  23. 23.  A neighbor pair of sketches can be detected in several chunks within Hamming distance d Si1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 … Sj0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 … HamDist(si1,sj1)>d HamDist(si3,sj3)>d HamDist(si5,sj5)≦d HamDist(si2,sj2)≦d HamDist(si4,sj4)>d Duplication!!  The same pair is outputted several times (Duplication)
  24. 24. Step1: Order chunks from left to right. 1 2 3 4 5 1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 … 0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 …a Step2: Check whether left chunks are no more than Hamming distance d 1 2 3 4 5 1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 … 0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 …a HamDist(si1,sj1)>d? HamDist(si3,sj3)>d? HamDist(si5,sj5)≦d HamDist(si2,sj2)>d? HamDist(si4,sj4)>d? - If such chunk is found, trash the pair, Or else check cosine distance
  25. 25. Block level Yes Trash duplication? No Check Hamming >d Distance? Trash ≦d chunk level Yes Trash duplication No Check Cosine >ε Distance Trash ≦ε Output
  26. 26.  Call function for each chunk  Check duplication four times  Divide sketches into equivalence classes  Call function recursively
  27. 27.  True edges E*, Our results E  Type-I error (false positive): A non-neighbor pair has a Hamming distance within d in at least one chunk  Type II-error (false negative): A neighbor pair has a Hamming distance larger than d in all chunks
  28. 28.  Basically, type-II error is more crucial - type-I errors are filtered out by distance calculations  Missing edge ratio (type-II error) is bounded as where p is an upper bound of the non-collision probability of neighbors
  29. 29.  Motivation - Large-scale data - Needs for All Pairs Similarity Search Method - Single Sorting Method, and its drawbacks  Method - Multiple Sorting Method - Cosine Locality Sensitive Hashing - SketchSort  Experiments - Comparison of other state-of-the art methods - Use large-scale image datasets
  30. 30.  Two image datasets - MNIST (60,000 data, 748 dimension) - Tiny Image (100,000 data, 960 dimension)  Use missing edge ratio as an evaluation measure  Set cosine distance threshold of 0.15π  Length of each chunk to 32bit  Hamming distance and number of blocks are set to (2,5) and (3,6).  Make number of chunks vary from 2, 4, 6, …, 50  Compare our method to Lanczos bisection method (JMLR, 2009)
  31. 31.  K-nearest neighbor graph construction by SketchSort - Keep k-nearest neighbor pairs by priority queue  Compare SketchSort to - Cover Tree (Beygelzimer et al., ICML 2006) - AllKNN (Ram et al., NIPS 2009), - Lanczos-bisection (JMLR, 2009)
  32. 32.  Set parameters so as to keep missing edge ratio no more than 1.0×10-6  Enable to detect similar pairs nearly exactly  Take only 4.3 hours for 1.6 million images 0.05π 0.1π 0.15π
  33. 33.  Fast all pairs similarity search method  Applicable to large-scale vector data  Various applications  Software - http://code.google.com/p/sketchsort/

×