Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Approximate nearest neighbor methods and vector models – NYC ML meetup

11,296 views

Published on

Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.

This presentation covers a library called Annoy built my me that that helps you do (approximate) nearest neighbor queries in high dimensional spaces. We're going through vector models, how to measure similarity, and why nearest neighbor queries are useful.

Published in: Engineering

Approximate nearest neighbor methods and vector models – NYC ML meetup

  1. 1. Approximate nearest neighbors & vector models
  2. 2. I’m Erik • @fulhack • Author of Annoy, Luigi • Currently CTO of Better • Previously 5 years at Spotify
  3. 3. What’s nearest neighbor(s) • Let’s say you have a bunch of points
  4. 4. Grab a bunch of points
  5. 5. 5 nearest neighbors
  6. 6. 20 nearest neighbors
  7. 7. 100 nearest neighbors
  8. 8. …But what’s the point? • vector models are everywhere • lots of applications (language processing, recommender systems, computer vision)
  9. 9. MNIST example • 28x28 = 784-dimensional dataset • Define distance in terms of pixels:
  10. 10. MNIST neighbors
  11. 11. …Much better approach 1. Start with high dimensional data 2. Run dimensionality reduction to 10-1000 dims 3. Do stuff in a small dimensional space
  12. 12. Deep learning for food • Deep model trained on a GPU on 6M random pics downloaded from Yelp 156x156x32 154x154x32 152x152x32 76x76x64 74x74x64 72x72x64 36x36x128 34x34x128 32x32x128 16x16x256 14x14x256 12x12x256 6x6x512 4x4x512 2x2x512 2048 2048 128 1244 3x3 convolutions 2x2 maxpool fully connected with dropout bottleneck layer
  13. 13. Distance in smaller space 1. Run image through the network 2. Use the 128-dimensional bottleneck layer as an item vector 3. Use cosine distance in the reduced space
  14. 14. Nearest food pics
  15. 15. Vector methods for text • TF-IDF (old) – no dimensionality reduction • Latent Semantic Analysis (1988) • Probabilistic Latent Semantic Analysis (2000) • Semantic Hashing (2007) • word2vec (2013), RNN, LSTM, …
  16. 16. Represent documents and/or words as f-dimensional vector Latentfactor1 Latent factor 2 banana apple boat
  17. 17. Vector methods for collaborative filtering • Supervised methods: See everything from the Netflix Prize • Unsupervised: Use NLP methods
  18. 18. CF vectors – examplesIPMF item item: P(i ! j) = exp(bT j bi)/Zi = exp(bT j bi) P k exp(bT k bi) VECTORS: pui = aT u bi simij = cos(bi, bj) = bT i bj |bi||bj| O(f) i j simi,j 2pac 2pac 1.0 2pac Notorious B.I.G. 0.91 2pac Dr. Dre 0.87 2pac Florence + the Machine 0.26 Florence + the Machine Lana Del Rey 0.81 IPMF item item MDS: P(i ! j) = exp(bT j bi)/Zi = exp( |bj bi| 2 ) P k exp( |bk bi| 2 ) simij = |bj bi| 2 (u, i, count) @L
  19. 19. Geospatial indexing • Ping the world: https://github.com/erikbern/ping • k-NN regression using Annoy
  20. 20. Nearest neighbors the brute force way • we can always do an exhaustive search to find the nearest neighbors • imagine MySQL doing a linear scan for every query…
  21. 21. Using word2vec’s brute force search $ time echo -e "chinese rivernEXITn" | ./distance GoogleNews- vectors-negative300.bin ! Qiantang_River 0.597229 Yangtse 0.587990 Yangtze_River 0.576738 lake 0.567611 rivers 0.567264 creek 0.567135 Mekong_river 0.550916 Xiangjiang_River 0.550451 Beas_river 0.549198 Minjiang_River 0.548721 real 2m34.346s user 1m36.235s sys 0m16.362s
  22. 22. Introducing Annoy • https://github.com/spotify/annoy • mmap-based ANN library • Written in C++, with Python and R bindings • 585 stars on Github
  23. 23. Using Annoy’s search $ time echo -e "chinese rivernEXITn" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors- negative300.bin 100000 Yangtse 0.907756 Yangtze_River 0.920067 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Huangpu_River 0.951850 Ganges 0.959261 Thu_Bon 0.960545 Yangtze 0.966199 Yangtze_river 0.978978 real 0m0.470s user 0m0.285s sys 0m0.162s
  24. 24. Using Annoy’s search $ time echo -e "chinese rivernEXITn" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors- negative300.bin 1000000 Qiantang_River 0.897519 Yangtse 0.907756 Yangtze_River 0.920067 lake 0.929934 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Xiangjiang_River 0.948208 Beas_river 0.949528 Minjiang_River 0.950031 real 0m2.013s user 0m1.386s sys 0m0.614s
  25. 25. (performance)
  26. 26. 1. Building an Annoy index
  27. 27. Start with the point set
  28. 28. Split it in two halves
  29. 29. Split again
  30. 30. Again…
  31. 31. …more iterations later
  32. 32. Side note: making trees small • Split until K items in each leaf (K~100) • Takes (n/K) memory instead of n
  33. 33. Binary tree
  34. 34. 2. Searching
  35. 35. Nearest neighbors
  36. 36. Searching the tree
  37. 37. Problemo • The point that’s the closest isn’t necessarily in the same leaf of the binary tree • Two points that are really close may end up on different sides of a split • Solution: go to both sides of a split if it’s close
  38. 38. Trick 1: Priority queue • Traverse the tree using a priority queue • sort by min(margin) for the path from the root
  39. 39. Trick 2: many trees • Construct trees randomly many times • Use the same priority queue to search all of them at the same time
  40. 40. heap + forest = best • Since we use a priority queue, we will dive down the best splits with the biggest distance • More trees always helps! • Only constraint is more trees require more RAM
  41. 41. Annoy query structure 1. Use priority queue to search all trees until we’ve found k items 2. Take union and remove duplicates (a lot) 3. Compute distance for remaining items 4. Return the nearest n items
  42. 42. Find candidates
  43. 43. Take union of all leaves
  44. 44. Compute distances
  45. 45. Return nearest neighbors
  46. 46. “Curse of dimensionality”
  47. 47. Are we screwed? • Would be nice if the data is has a much smaller “intrinsic dimension”!
  48. 48. Improving the algorithm Queries/s 1-NN accuracy more accurate faster
  49. 49. • https://github.com/erikbern/ann-benchmarks ann-benchmarks
  50. 50. perf/accuracy tradeoffs Queries/s 1-NN accuracy search more nodes more trees
  51. 51. Things that work • Smarter plane splitting • Priority queue heuristics • Search more nodes than number of results • Align nodes closer together
  52. 52. Things that don’t work • Use lower-precision arithmetic • Priority queue by other heuristics (number of trees) • Precompute vector norms
  53. 53. Things for the future • Use a optimization scheme for tree building • Add more distance functions (eg. edit distance) • Use a proper KV store as a backend (eg. LMDB) to support incremental adds, out-of-core, arbitrary keys: https://github.com/Houzz/annoy2
  54. 54. Thanks! • https://github.com/spotify/annoy • https://github.com/erikbern/ann-benchmarks • https://github.com/erikbern/ann-presentation • erikbern.com • @fulhack

×