Upcoming SlideShare
×

# The Post Office Problem

2,409 views

Published on

A talk I presented at Nashville Hack Day 2012 concerning using multiple random low-dimensional projections of high-dimensional data to optimize an approximate nearest neighbor search

Published in: Technology, Business
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

• Be the first to like this

Views
Total views
2,409
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
24
0
Likes
0
Embeds 0
No embeds

No notes for slide

### The Post Office Problem

1. 1. The Post Office Problemk-d trees, k-nn search, and the Johnson-Lindenstrauss lemma
2. 2. Who am I? Jeremy Holland Senior lead developer at Centresource Math and algorithms nerd @awebneck, github.com/awebneck, freenode: awebneck, (you get the idea)
3. 3. If you like the talk... I like scotch. Just putting it out there.
4. 4. What is the Post Office Problem? Don Knuth, professional CS badass. TAOCP, vol. 3 Otherwise known as ”Nearest Neighbor search” Lets say youve...
5. 5. Just moved to Denmark!
6. 6. But you need to mail a letter! Which post office do you go to?
7. 7. Finding free images of postoffices is hard, so... Well just reduce it to this: q
8. 8. Naive implementation Calculate distance to all points, find smallestmin = INFINITYP = <points to be searched>K = <dimensionality of points, e.g. 2>q = <query point>best = nilfor p in P do dimDistSum = 0 for k in K do dimDistSum += (q[k]-p[k])**2 dist = dimDistSum.sqrt if dist < min min = dist best = preturn best
9. 9. With a little preprocessing... But that takes time! - can we do better? You bet! k-d tree Binary tree (each node has at most two children) Each node represents a single point in the set to be searched
10. 10. Each node looks like... Domain: the vector describing the point (i.e. [p[0], p[1], … p[k-1]]) Range: Some identifying characteristic (e.g. PK in a database) Split: A chosen dimension from 0 ≤ split < k Left: The left child (left.domain[split] < self.domain[split]) Right: The right child (right.domain[split] ≥ self.domain[split])
11. 11. Lets build a k-d tree! Point 1: [20,10]
12. 12. Lets build a k-d tree! Lets split on the x axis
13. 13. Lets build a k-d tree! Add a new point: [10,5]
14. 14. Lets build a k-d tree! The new point is the Left Child of the first point
15. 15. Lets build a k-d tree! Lets split him on the y axis
16. 16. Lets build a k-d tree! And add a 3rd point: [25,3]
17. 17. Lets build a k-d tree! The new point is the Right Child of the first point
18. 18. Lets build a k-d tree! So on and so forth...
19. 19. Lets build a k-d tree! Giving you a tree:
20. 20. How do we search it? Step 1: Find the best bin (where the query point would otherwise be inserted) q root
21. 21. How do we search it? NOTE: There is no node for this bin – just the space a node would be if existed! q root
22. 22. How do we search it? Step 2: Make the current leaf node the current ”best guess” Best guess
23. 23. How do we search it? … and set the ”best guess radius” to be the distance between the query and that point Best guess radius
24. 24. How do we search it? Step 3: Back up the tree 1 node Current node
25. 25. How do we search it? If the distance between the query and the new node is less than the best guess radius...
26. 26. How do we search it? Then set the best guess radius to the new distance, and make the current node the best
27. 27. How do we search it?Step 4: If the hypersphere described by the best guess radius crosses the current split... Oh nooooh!
28. 28. How do we search it? And the current node has a child on the other side... Oh snap!
29. 29. How do we search it? … then make that node the current node, and repeat:
30. 30. How do we search it?Here, the distance is not less than the best guess radius...
31. 31. How do we search it? … and the hypersphere neither crosses the split ... Whew, missed it!
32. 32. How do we search it?… nor does the current node have any children ... Whew, missed it!
33. 33. How do we search it?So we can eliminate it and back up the tree again!
34. 34. How do we search it?Weve already compared this node, so lets keep going back up the tree
35. 35. How do we search it? Again, the radius is bigger than the best guess, and there is no crossing – back up again!
36. 36. How do we search it? ...and again...
37. 37. How do we search it? All the way back to the root!
38. 38. How do we search it?And you have your nearest neighbor, with a good case of running time! Im the answer!
39. 39. But that was a pretty good case... We barely had to backtrack at all – best case is Worst case (lots of backtracking – examining almost every node) can get up to Amount of backtracking is directly proportional to k! If k is small (say 2, as in this example) and n is large, we see a huge improvement over linear search As k becomes large, the benefits of this over a naive implementation virtually disappear!
40. 40. The Curse of Dimensionality Curse you, dimensionality! High-dimensional vector spaces are darned hard to search! Why? Too many dimensions! Why are there so many dimensions!?! What can we do about it? Get rid of the extra weight! Enter Mssrs. Johnson and Lindenstrauss
41. 41. It turns out... Your vectors have a high dimension Absolute distance and precise location versus relative distance between points Relative distance can be largely preserved by a lower dimensional space Reduce k dimensions to kproj dimensions, kproj << k
42. 42. Example: 2d to 1d 11.180 5.0 00 8.246 7.28 17. 464 0 7. 6 0 .29 71 10 3.16 13.34 2 2 14.000
43. 43. Example: 2d to 1d, 1st attempt 11.180 Projection Plane 5.0 00 8.246 7.28 17. 464 0 7. 6 0 .29 71 10 3.16 13.34 2 2 14.000
44. 44. Example: 2d to 1d, 1st attempt 11.180 5.0 00 8.246 7.28 17. 464 0 7. 6 0 .29 71 10 3.16 13.34 2 2 14.000 Finished 1-d Projection
45. 45. Example: 2d to 1d, 2nd attempt 11.180 Projection Plane 5.0 00 8.246 7.28 17. 464 0 7. 6 0 .29 71 10 3.16 13.34 2 2 14.000
46. 46. Example: 2d to 1d, 2nd attempt 11.180 5.0 00 8.246 7.28 17. 464 0 7. 6 0 .29 71 10 3.16 13.34 2 2 14.000 Finished 1-d Projection
47. 47. Example: 2d to 1d, 3rd attempt 11.180 Projection Plane 5.0 00 8.246 7.28 17. 464 0 7. 6 0 .29 71 10 3.16 13.34 2 2 14.000
48. 48. Example: 2d to 1d, 3rd attempt 11.180 Projection Plane 5.0 00 8.246 7.28 17. 464 0 7. 6 0 .29 71 10 3.16 13.34 2 2 14.000 Finished 1-d Projection
49. 49. It turns out... Relative distance can be largely but not completely preserved by a lower dimensional space Every projection will have errors How do you choose one with the fewest? Trick question: Let fate decide!
50. 50. Multiple random projection Choose the projections radomly Multiple projections Exchange cost in resources for cost in accuracy  More projections = greater resource cost = greater accuracy  Fewer projections = lesser resource cost = lesser accuracy Trivially parallelizable Learn to be happy with ”good enough”
51. 51. Multiple random projections Get the nearest from each projection, then run a naive nearest on the results thereof.Nns = []P = <projections>q = <query point>for p in P do pq = <project q to the same plane as p> nns << <nearest neighbor to pq from projection><execute naive nearest on nns to find nearest of result>return nn Et voilá!
52. 52. Multiple random projection Experiments yield > 98% accuracy when multiple nearest neighbors are selected from each projection and d is reduced from 256 to 15, with approximately 30% of the calculation. (see credits) Additional experiments yielded similar results, as did my own Thats pretty darn-tootin good
53. 53. Stuff to watch out for Balancing is vitally important (assuming uniform distribution of points): careful attention must be paid to selection of nodes (node with median coordinate for split axis) Cycle through axes for each level of the tree – root should split on 0, lvl 1 on 1, lvl 2 on 2, etc.
54. 54. Stuff to watch out for Building the trees still takes some time  Building the projections is effectively matrix multiplication, time in (Strassens algorithm)  Building the (balanced) trees from the projections takes time in approximately Solution: build the trees ahead of time and store them for later querying (i.e. index those bad boys!)
55. 55. Thanks! Credits:  Based in large part on research conducted by Yousuf Ahmed, NYU: http://bit.ly/NZ7ZHo  K-d trees: J. L. Bentley, Stanford U.: http://bit.ly/Mpy05p  Dimensionality reduction: W. B. Johnson and J. Lindenstrauss: http://bit.ly/m9SGPN  Research Fuel: Ardbeg Uigeadail: http://bit.ly/fcag0E