Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.
This presentation covers a library called Annoy built my me that that helps you do (approximate) nearest neighbor queries in high dimensional spaces. We're going through vector models, how to measure similarity, and why nearest neighbor queries are useful.
11. …Much better approach
1. Start with high dimensional data
2. Run dimensionality reduction to 10-1000 dims
3. Do stuff in a small dimensional space
12. Deep learning for food
• Deep model trained on a GPU on 6M random pics
downloaded from Yelp
156x156x32
154x154x32
152x152x32
76x76x64
74x74x64
72x72x64
36x36x128
34x34x128
32x32x128
16x16x256
14x14x256
12x12x256
6x6x512
4x4x512
2x2x512
2048
2048
128
1244
3x3 convolutions
2x2 maxpool
fully
connected
with dropout
bottleneck
layer
13. Distance in smaller space
1. Run image through the network
2. Use the 128-dimensional bottleneck layer as an
item vector
3. Use cosine distance in the reduced space
20. Nearest neighbors the
brute force way
• we can always do an exhaustive search to find the
nearest neighbors
• imagine MySQL doing a linear scan for every
query…
21. Using word2vec’s brute
force search
$ time echo -e "chinese rivernEXITn" | ./distance GoogleNews-
vectors-negative300.bin
!
Qiantang_River 0.597229
Yangtse 0.587990
Yangtze_River 0.576738
lake 0.567611
rivers 0.567264
creek 0.567135
Mekong_river 0.550916
Xiangjiang_River 0.550451
Beas_river 0.549198
Minjiang_River 0.548721
real 2m34.346s
user 1m36.235s
sys 0m16.362s
38. Problemo
• The point that’s the closest isn’t necessarily in the
same leaf of the binary tree
• Two points that are really close may end up on
different sides of a split
• Solution: go to both sides of a split if it’s close
39.
40.
41.
42. Trick 1: Priority queue
• Traverse the tree using a priority queue
• sort by min(margin) for the path from the root
43. Trick 2: many trees
• Construct trees randomly many times
• Use the same priority queue to search all of them at
the same time
44.
45. heap + forest = best
• Since we use a priority queue, we will dive down
the best splits with the biggest distance
• More trees always helps!
• Only constraint is more trees require more RAM
46. Annoy query structure
1. Use priority queue to search all trees until we’ve
found k items
2. Take union and remove duplicates (a lot)
3. Compute distance for remaining items
4. Return the nearest n items
59. Things that work
• Smarter plane splitting
• Priority queue heuristics
• Search more nodes than number of results
• Align nodes closer together
60. Things that don’t work
• Use lower-precision arithmetic
• Priority queue by other heuristics (number of trees)
• Precompute vector norms
61. Things for the future
• Use a optimization scheme for tree building
• Add more distance functions (eg. edit distance)
• Use a proper KV store as a backend (eg. LMDB) to
support incremental adds, out-of-core, arbitrary
keys: https://github.com/Houzz/annoy2