Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Yuri M. Brovman, Data Scientist, eBay by MLconf 625 views
- Shiva Amiri, CEO, Biosymetrics, at ... by MLconf 329 views
- Byron Galbraith, Chief Data Scienti... by MLconf 866 views
- Mukund Narasimhan, Engineer, Pinter... by MLconf 728 views
- Evan Estola, Lead Machine Learning ... by MLconf 447 views
- Margaret Mitchell, Senior Research ... by MLconf 913 views

375 views

Published on

Abstract Summary:

Nearest Neighbor Methods And Vector Models: Vector models are being used in a lot of different fields: natural language processing, recommender systems, computer vision, and other things. They are fast and convenient and are often state of the art in terms of accuracy. One of the challenges with vector models is that as the number of dimensions increase, finding similar items gets challenging. Erik developed a library called “Annoy” that uses a forest of random tree to do fast approximate nearest neighbor queries in high dimensional spaces. We will cover some specific applications of vector models with and how Annoy works.

Published in:
Technology

No Downloads

Total views

375

On SlideShare

0

From Embeds

0

Number of Embeds

5

Shares

0

Downloads

25

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Approximate nearest neighbors & vector models
- 2. I’m Erik • @fulhack • Author of Annoy, Luigi • Currently CTO of Better • Previously 5 years at Spotify
- 3. What’s nearest neighbor(s) • Let’s say you have a bunch of points
- 4. Grab a bunch of points
- 5. 5 nearest neighbors
- 6. 20 nearest neighbors
- 7. 100 nearest neighbors
- 8. …But what’s the point? • vector models are everywhere • lots of applications (language processing, recommender systems, computer vision)
- 9. MNIST example • 28x28 = 784-dimensional dataset • Deﬁne distance in terms of pixels:
- 10. MNIST neighbors
- 11. …Much better approach 1. Start with high dimensional data 2. Run dimensionality reduction to 10-1000 dims 3. Do stuff in a small dimensional space
- 12. Deep learning for food • Deep model trained on a GPU on 6M random pics downloaded from Yelp 156x156x32 154x154x32 152x152x32 76x76x64 74x74x64 72x72x64 36x36x128 34x34x128 32x32x128 16x16x256 14x14x256 12x12x256 6x6x512 4x4x512 2x2x512 2048 2048 128 1244 3x3 convolutions 2x2 maxpool fully connected with dropout bottleneck layer
- 13. Distance in smaller space 1. Run image through the network 2. Use the 128-dimensional bottleneck layer as an item vector 3. Use cosine distance in the reduced space
- 14. Nearest food pics
- 15. Vector methods for text • TF-IDF (old) – no dimensionality reduction • Latent Semantic Analysis (1988) • Probabilistic Latent Semantic Analysis (2000) • Semantic Hashing (2007) • word2vec (2013), RNN, LSTM, …
- 16. Represent documents and/or words as f-dimensional vector Latentfactor1 Latent factor 2 banana apple boat
- 17. Vector methods for collaborative ﬁltering • Supervised methods: See everything from the Netﬂix Prize • Unsupervised: Use NLP methods
- 18. CF vectors – examplesIPMF item item: P(i ! j) = exp(bT j bi)/Zi = exp(bT j bi) P k exp(bT k bi) VECTORS: pui = aT u bi simij = cos(bi, bj) = bT i bj |bi||bj| O(f) i j simi,j 2pac 2pac 1.0 2pac Notorious B.I.G. 0.91 2pac Dr. Dre 0.87 2pac Florence + the Machine 0.26 Florence + the Machine Lana Del Rey 0.81 IPMF item item MDS: P(i ! j) = exp(bT j bi)/Zi = exp( |bj bi| 2 ) P k exp( |bk bi| 2 ) simij = |bj bi| 2 (u, i, count) @L
- 19. Geospatial indexing • Ping the world: https://github.com/erikbern/ping • k-NN regression using Annoy
- 20. low-dimensional embedding • “Visualizing Large-scale and High-dimensional Data” • https://github.com/elbamos/largeVis (R implementation)
- 21. Nearest neighbors the brute force way • we can always do an exhaustive search to ﬁnd the nearest neighbors • imagine MySQL doing a linear scan for every query…
- 22. Using word2vec’s brute force search $ time echo -e "chinese rivernEXITn" | ./distance GoogleNews- vectors-negative300.bin Qiantang_River 0.597229 Yangtse 0.587990 Yangtze_River 0.576738 lake 0.567611 rivers 0.567264 creek 0.567135 Mekong_river 0.550916 Xiangjiang_River 0.550451 Beas_river 0.549198 Minjiang_River 0.548721 real2m34.346s user1m36.235s sys 0m16.362s
- 23. Introducing Annoy • https://github.com/spotify/annoy • mmap-based ANN library • Written in C++, with Python/R/Go/Lua bindings • 585 1227 stars on Github
- 24. mmap = best • load huge data ﬁles immediately • share data between processes
- 25. Using Annoy’s search $ time echo -e "chinese rivernEXITn" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors- negative300.bin 100000 Yangtse 0.907756 Yangtze_River 0.920067 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Huangpu_River 0.951850 Ganges 0.959261 Thu_Bon 0.960545 Yangtze 0.966199 Yangtze_river 0.978978 real0m0.470s user0m0.285s sys 0m0.162s
- 26. Using Annoy’s search $ time echo -e "chinese rivernEXITn" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors- negative300.bin 1000000 Qiantang_River 0.897519 Yangtse 0.907756 Yangtze_River 0.920067 lake 0.929934 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Xiangjiang_River 0.948208 Beas_river 0.949528 Minjiang_River 0.950031 real0m2.013s user0m1.386s sys 0m0.614s
- 27. (performance)
- 28. 1. Building an Annoy index
- 29. Start with the point set
- 30. Split it in two halves
- 31. Split again
- 32. Again…
- 33. …more iterations later
- 34. Side note: making trees small • Split until K items in each leaf (K~100) • Takes (n/K) memory instead of n
- 35. Binary tree
- 36. 2. Searching
- 37. Nearest neighbors
- 38. Searching the tree
- 39. Problemo • The point that’s the closest isn’t necessarily in the same leaf of the binary tree • Two points that are really close may end up on different sides of a split • Solution: go to both sides of a split if it’s close
- 40. Trick 1: Priority queue • Traverse the tree using a priority queue • sort by min(margin) for the path from the root
- 41. Trick 2: many trees • Construct trees randomly many times • Use the same priority queue to search all of them at the same time
- 42. heap + forest = best • Since we use a priority queue, we will dive down the best splits with the biggest distance • More trees always helps! • Only constraint is more trees require more RAM
- 43. Annoy query structure 1. Use priority queue to search all trees until we’ve found k items 2. Take union and remove duplicates (a lot) 3. Compute distance for remaining items 4. Return the nearest n items
- 44. Find candidates
- 45. Take union of all leaves
- 46. Compute distances
- 47. Return nearest neighbors
- 48. “Curse of dimensionality”
- 49. Are we screwed? • Would be nice if the data is has a much smaller “intrinsic dimension”!
- 50. Improving the algorithm Queries/s 1-NN accuracy more accurate faster
- 51. • https://github.com/erikbern/ann-benchmarks ann-benchmarks
- 52. • https://github.com/erikbern/ann-benchmarks ann-benchmarks X
- 53. • https://github.com/erikbern/ann-benchmarks ann-benchmarks
- 54. Current ANN trends • “small world” graph algorithms: SW-graph, k-graph • locality sensitive hashing: HNSW, FALCONN
- 55. Thanks! • https://github.com/spotify/annoy • https://github.com/erikbern/ann-benchmarks • https://github.com/erikbern/ann-presentation • erikbern.com • @fulhack

No public clipboards found for this slide

Be the first to comment