Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Erik Bernhardsson, CTO, Better Mortgage

368 views

Published on

Erik Bernhardsson is the CTO at Better, a small startup in NYC working with mortgages. Before Better, he spent five years at Spotify managing teams working with machine learning and data analytics, in particular music recommendations.

Abstract Summary:

Nearest Neighbor Methods And Vector Models: Vector models are being used in a lot of different fields: natural language processing, recommender systems, computer vision, and other things. They are fast and convenient and are often state of the art in terms of accuracy. One of the challenges with vector models is that as the number of dimensions increase, finding similar items gets challenging. Erik developed a library called “Annoy” that uses a forest of random tree to do fast approximate nearest neighbor queries in high dimensional spaces. We will cover some specific applications of vector models with and how Annoy works.

Published in: Technology
  • Be the first to comment

Erik Bernhardsson, CTO, Better Mortgage

  1. 1. Approximate nearest neighbors & vector models
  2. 2. I’m Erik • @fulhack • Author of Annoy, Luigi • Currently CTO of Better • Previously 5 years at Spotify
  3. 3. What’s nearest neighbor(s) • Let’s say you have a bunch of points
  4. 4. Grab a bunch of points
  5. 5. 5 nearest neighbors
  6. 6. 20 nearest neighbors
  7. 7. 100 nearest neighbors
  8. 8. …But what’s the point? • vector models are everywhere • lots of applications (language processing, recommender systems, computer vision)
  9. 9. MNIST example • 28x28 = 784-dimensional dataset • Define distance in terms of pixels:
  10. 10. MNIST neighbors
  11. 11. …Much better approach 1. Start with high dimensional data 2. Run dimensionality reduction to 10-1000 dims 3. Do stuff in a small dimensional space
  12. 12. Deep learning for food • Deep model trained on a GPU on 6M random pics downloaded from Yelp 156x156x32 154x154x32 152x152x32 76x76x64 74x74x64 72x72x64 36x36x128 34x34x128 32x32x128 16x16x256 14x14x256 12x12x256 6x6x512 4x4x512 2x2x512 2048 2048 128 1244 3x3 convolutions 2x2 maxpool fully connected with dropout bottleneck layer
  13. 13. Distance in smaller space 1. Run image through the network 2. Use the 128-dimensional bottleneck layer as an item vector 3. Use cosine distance in the reduced space
  14. 14. Nearest food pics
  15. 15. Vector methods for text • TF-IDF (old) – no dimensionality reduction • Latent Semantic Analysis (1988) • Probabilistic Latent Semantic Analysis (2000) • Semantic Hashing (2007) • word2vec (2013), RNN, LSTM, …
  16. 16. Represent documents and/or words as f-dimensional vector Latentfactor1 Latent factor 2 banana apple boat
  17. 17. Vector methods for collaborative filtering • Supervised methods: See everything from the Netflix Prize • Unsupervised: Use NLP methods
  18. 18. CF vectors – examplesIPMF item item: P(i ! j) = exp(bT j bi)/Zi = exp(bT j bi) P k exp(bT k bi) VECTORS: pui = aT u bi simij = cos(bi, bj) = bT i bj |bi||bj| O(f) i j simi,j 2pac 2pac 1.0 2pac Notorious B.I.G. 0.91 2pac Dr. Dre 0.87 2pac Florence + the Machine 0.26 Florence + the Machine Lana Del Rey 0.81 IPMF item item MDS: P(i ! j) = exp(bT j bi)/Zi = exp( |bj bi| 2 ) P k exp( |bk bi| 2 ) simij = |bj bi| 2 (u, i, count) @L
  19. 19. Geospatial indexing • Ping the world: https://github.com/erikbern/ping • k-NN regression using Annoy
  20. 20. low-dimensional embedding • “Visualizing Large-scale and High-dimensional Data” • https://github.com/elbamos/largeVis (R implementation)
  21. 21. Nearest neighbors the brute force way • we can always do an exhaustive search to find the nearest neighbors • imagine MySQL doing a linear scan for every query…
  22. 22. Using word2vec’s brute force search $ time echo -e "chinese rivernEXITn" | ./distance GoogleNews- vectors-negative300.bin Qiantang_River 0.597229 Yangtse 0.587990 Yangtze_River 0.576738 lake 0.567611 rivers 0.567264 creek 0.567135 Mekong_river 0.550916 Xiangjiang_River 0.550451 Beas_river 0.549198 Minjiang_River 0.548721 real2m34.346s user1m36.235s sys 0m16.362s
  23. 23. Introducing Annoy • https://github.com/spotify/annoy • mmap-based ANN library • Written in C++, with Python/R/Go/Lua bindings • 585 1227 stars on Github
  24. 24. mmap = best • load huge data files immediately • share data between processes
  25. 25. Using Annoy’s search $ time echo -e "chinese rivernEXITn" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors- negative300.bin 100000 Yangtse 0.907756 Yangtze_River 0.920067 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Huangpu_River 0.951850 Ganges 0.959261 Thu_Bon 0.960545 Yangtze 0.966199 Yangtze_river 0.978978 real0m0.470s user0m0.285s sys 0m0.162s
  26. 26. Using Annoy’s search $ time echo -e "chinese rivernEXITn" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors- negative300.bin 1000000 Qiantang_River 0.897519 Yangtse 0.907756 Yangtze_River 0.920067 lake 0.929934 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Xiangjiang_River 0.948208 Beas_river 0.949528 Minjiang_River 0.950031 real0m2.013s user0m1.386s sys 0m0.614s
  27. 27. (performance)
  28. 28. 1. Building an Annoy index
  29. 29. Start with the point set
  30. 30. Split it in two halves
  31. 31. Split again
  32. 32. Again…
  33. 33. …more iterations later
  34. 34. Side note: making trees small • Split until K items in each leaf (K~100) • Takes (n/K) memory instead of n
  35. 35. Binary tree
  36. 36. 2. Searching
  37. 37. Nearest neighbors
  38. 38. Searching the tree
  39. 39. Problemo • The point that’s the closest isn’t necessarily in the same leaf of the binary tree • Two points that are really close may end up on different sides of a split • Solution: go to both sides of a split if it’s close
  40. 40. Trick 1: Priority queue • Traverse the tree using a priority queue • sort by min(margin) for the path from the root
  41. 41. Trick 2: many trees • Construct trees randomly many times • Use the same priority queue to search all of them at the same time
  42. 42. heap + forest = best • Since we use a priority queue, we will dive down the best splits with the biggest distance • More trees always helps! • Only constraint is more trees require more RAM
  43. 43. Annoy query structure 1. Use priority queue to search all trees until we’ve found k items 2. Take union and remove duplicates (a lot) 3. Compute distance for remaining items 4. Return the nearest n items
  44. 44. Find candidates
  45. 45. Take union of all leaves
  46. 46. Compute distances
  47. 47. Return nearest neighbors
  48. 48. “Curse of dimensionality”
  49. 49. Are we screwed? • Would be nice if the data is has a much smaller “intrinsic dimension”!
  50. 50. Improving the algorithm Queries/s 1-NN accuracy more accurate faster
  51. 51. • https://github.com/erikbern/ann-benchmarks ann-benchmarks
  52. 52. • https://github.com/erikbern/ann-benchmarks ann-benchmarks X
  53. 53. • https://github.com/erikbern/ann-benchmarks ann-benchmarks
  54. 54. Current ANN trends • “small world” graph algorithms: SW-graph, k-graph • locality sensitive hashing: HNSW, FALCONN
  55. 55. Thanks! • https://github.com/spotify/annoy • https://github.com/erikbern/ann-benchmarks • https://github.com/erikbern/ann-presentation • erikbern.com • @fulhack

×