This presentation covers a library called Annoy built my me that that helps you do (approximate) nearest neighbor queries in high dimensional spaces. We're going through vector models, how to measure similarity, and why nearest neighbor queries are useful.

- 1. Approximate nearest neighbors & vector models
- 2. I’m Erik • @fulhack • Author of Annoy, Luigi • Currently CTO of Better • Previously 5 years at Spotify
- 3. What’s nearest neighbor(s) • Let’s say you have a bunch of points
- 4. Grab a bunch of points
- 5. 5 nearest neighbors
- 6. 20 nearest neighbors
- 7. 100 nearest neighbors
- 8. …But what’s the point? • vector models are everywhere • lots of applications (language processing, recommender systems, computer vision)
- 9. MNIST example • 28x28 = 784-dimensional dataset • Deﬁne distance in terms of pixels:
- 10. MNIST neighbors
- 11. …Much better approach 1. Start with high dimensional data 2. Run dimensionality reduction to 10-1000 dims 3. Do stuff in a small dimensional space
- 12. Deep learning for food • Deep model trained on a GPU on 6M random pics downloaded from Yelp 156x156x32 154x154x32 152x152x32 76x76x64 74x74x64 72x72x64 36x36x128 34x34x128 32x32x128 16x16x256 14x14x256 12x12x256 6x6x512 4x4x512 2x2x512 2048 2048 128 1244 3x3 convolutions 2x2 maxpool fully connected with dropout bottleneck layer
- 13. Distance in smaller space 1. Run image through the network 2. Use the 128-dimensional bottleneck layer as an item vector 3. Use cosine distance in the reduced space
- 14. Nearest food pics
- 15. Vector methods for text • TF-IDF (old) – no dimensionality reduction • Latent Semantic Analysis (1988) • Probabilistic Latent Semantic Analysis (2000) • Semantic Hashing (2007) • word2vec (2013), RNN, LSTM, …
- 16. Represent documents and/or words as f-dimensional vector Latentfactor1 Latent factor 2 banana apple boat
- 17. Vector methods for collaborative ﬁltering • Supervised methods: See everything from the Netﬂix Prize • Unsupervised: Use NLP methods
- 18. CF vectors – examplesIPMF item item: P(i ! j) = exp(bT j bi)/Zi = exp(bT j bi) P k exp(bT k bi) VECTORS: pui = aT u bi simij = cos(bi, bj) = bT i bj |bi||bj| O(f) i j simi,j 2pac 2pac 1.0 2pac Notorious B.I.G. 0.91 2pac Dr. Dre 0.87 2pac Florence + the Machine 0.26 Florence + the Machine Lana Del Rey 0.81 IPMF item item MDS: P(i ! j) = exp(bT j bi)/Zi = exp( |bj bi| 2 ) P k exp( |bk bi| 2 ) simij = |bj bi| 2 (u, i, count) @L
- 19. Geospatial indexing • Ping the world: https://github.com/erikbern/ping • k-NN regression using Annoy
- 20. Nearest neighbors the brute force way • we can always do an exhaustive search to ﬁnd the nearest neighbors • imagine MySQL doing a linear scan for every query…
- 21. Using word2vec’s brute force search $ time echo -e "chinese rivernEXITn" | ./distance GoogleNews- vectors-negative300.bin ! Qiantang_River 0.597229 Yangtse 0.587990 Yangtze_River 0.576738 lake 0.567611 rivers 0.567264 creek 0.567135 Mekong_river 0.550916 Xiangjiang_River 0.550451 Beas_river 0.549198 Minjiang_River 0.548721 real 2m34.346s user 1m36.235s sys 0m16.362s
- 22. Introducing Annoy • https://github.com/spotify/annoy • mmap-based ANN library • Written in C++, with Python and R bindings • 585 stars on Github
- 23. Using Annoy’s search $ time echo -e "chinese rivernEXITn" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors- negative300.bin 100000 Yangtse 0.907756 Yangtze_River 0.920067 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Huangpu_River 0.951850 Ganges 0.959261 Thu_Bon 0.960545 Yangtze 0.966199 Yangtze_river 0.978978 real 0m0.470s user 0m0.285s sys 0m0.162s
- 24. Using Annoy’s search $ time echo -e "chinese rivernEXITn" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors- negative300.bin 1000000 Qiantang_River 0.897519 Yangtse 0.907756 Yangtze_River 0.920067 lake 0.929934 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Xiangjiang_River 0.948208 Beas_river 0.949528 Minjiang_River 0.950031 real 0m2.013s user 0m1.386s sys 0m0.614s
- 25. (performance)
- 26. 1. Building an Annoy index
- 27. Start with the point set
- 28. Split it in two halves
- 29. Split again
- 30. Again…
- 31. …more iterations later
- 32. Side note: making trees small • Split until K items in each leaf (K~100) • Takes (n/K) memory instead of n
- 33. Binary tree
- 34. 2. Searching
- 35. Nearest neighbors
- 36. Searching the tree
- 37. Problemo • The point that’s the closest isn’t necessarily in the same leaf of the binary tree • Two points that are really close may end up on different sides of a split • Solution: go to both sides of a split if it’s close
- 38. Trick 1: Priority queue • Traverse the tree using a priority queue • sort by min(margin) for the path from the root
- 39. Trick 2: many trees • Construct trees randomly many times • Use the same priority queue to search all of them at the same time
- 40. heap + forest = best • Since we use a priority queue, we will dive down the best splits with the biggest distance • More trees always helps! • Only constraint is more trees require more RAM
- 41. Annoy query structure 1. Use priority queue to search all trees until we’ve found k items 2. Take union and remove duplicates (a lot) 3. Compute distance for remaining items 4. Return the nearest n items
- 42. Find candidates
- 43. Take union of all leaves
- 44. Compute distances
- 45. Return nearest neighbors
- 46. “Curse of dimensionality”
- 47. Are we screwed? • Would be nice if the data is has a much smaller “intrinsic dimension”!
- 48. Improving the algorithm Queries/s 1-NN accuracy more accurate faster
- 49. • https://github.com/erikbern/ann-benchmarks ann-benchmarks
- 50. perf/accuracy tradeoffs Queries/s 1-NN accuracy search more nodes more trees
- 51. Things that work • Smarter plane splitting • Priority queue heuristics • Search more nodes than number of results • Align nodes closer together
- 52. Things that don’t work • Use lower-precision arithmetic • Priority queue by other heuristics (number of trees) • Precompute vector norms
- 53. Things for the future • Use a optimization scheme for tree building • Add more distance functions (eg. edit distance) • Use a proper KV store as a backend (eg. LMDB) to support incremental adds, out-of-core, arbitrary keys: https://github.com/Houzz/annoy2
- 54. Thanks! • https://github.com/spotify/annoy • https://github.com/erikbern/ann-benchmarks • https://github.com/erikbern/ann-presentation • erikbern.com • @fulhack

