Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An introduction to similarity search and k-nn graphs

Similarity search is an essential component of machine learning algorithms. However, performing efficient similarity search can be extremely challenging, especially if the dataset is distributed between multiple computers, and even more if the similarity measure is not a metric. With the rise of Big Data processing, these challenging datasets are actually more and more common. In this presentation we show how k nearest neighbors (k-nn) graphs can be used to perform similarity search, clustering and anomaly detection.

  • Be the first to comment

  • Be the first to like this

An introduction to similarity search and k-nn graphs

  1. 1. Distributed k-nearest neighbors graph algorithms Thibault Debatty, Ir PhD 2019-12-03
  2. 2. Distributed k-nearest neighbors graph algorithms 2 k-nn graph Edge to k most similar nodes
  3. 3. Distributed k-nearest neighbors graph algorithms 3 Context Common tasks of machine learning, data mining, Artificial Intelligence or Big Data: ● Similarity search ● Clustering ● Anomaly detection
  4. 4. Distributed k-nearest neighbors graph algorithms 4 Context : similarity search
  5. 5. Distributed k-nearest neighbors graph algorithms 5 Context : similarity search “High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 v8g6” Similar to a known SPAM?
  6. 6. Distributed k-nearest neighbors graph algorithms 6 Context : clustering Kobe Bryant traded to Clippers No.1 Ma1eEnhancement Supplement. Trusted by Millions. Buy Today! J9 Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price 6xkp Percocet 10/625 mg withoutPrescription 30 tabs - $225! [20100815-3] rjj Nurses make Great Incomes Order Now! HYDROCODONE BRAND Watson 540 10mg/mg, 60 Pills - $479, 90 Pills - $656, 120 Pills - $838 36fy High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 7z3 Play here for summer fun Perfect Watches Clones Cheap from $150. Buy Rep1icaWatches: Swiss Rep1icaWatch xz High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 v69 Obtain details on your cred1t online. Get started today High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 071 Japanese food discount Is your computer safe? Luxury at a Discount! Phentermin 37.5 mg as cheap as 120 pills $366.00 8eg5 Mutant fish sold at Connecticut market High quality JBL speakers Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price xt High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 pl => Identify SPAM campaigns
  7. 7. Distributed k-nearest neighbors graph algorithms 7 Context : clustering To analyze 300 rogue websites: ● Cluster ● Analyze 1 representative of each group
  8. 8. Distributed k-nearest neighbors graph algorithms 8 Context : anomaly detection Find infected computer on a network
  9. 9. Distributed k-nearest neighbors graph algorithms 9 Context ● Similarity search ● Clustering and ● Anomaly detection … are crucial for data processing!
  10. 10. Distributed k-nearest neighbors graph algorithms 10 Challenges How hard can that be?
  11. 11. Distributed k-nearest neighbors graph algorithms 11 Challenges Computer memory is similar to a book ● Accessible by address (page) ● You have to read before you know the content (e.g. coordinates of a point)
  12. 12. Distributed k-nearest neighbors graph algorithms 12 Challenges Naive similarity search requires to read all pages
  13. 13. Distributed k-nearest neighbors graph algorithms 13 Challenges How many pages? Bible TOB: ● 2000 pages ● Extra thin paper ● 12cm ● 44 hours of reading
  14. 14. Distributed k-nearest neighbors graph algorithms 14 Challenges Samsung Galaxy S9 (4GB) 63m assuming 4KB/page (atomium = 102m), 2.6 years of reading...
  15. 15. Distributed k-nearest neighbors graph algorithms 15 Challenges Our server ● 1500GB ● 200.000 books ● A stack of 24km ● 1000 years of reading Brussels – Louvain la Neuve = 26km
  16. 16. Distributed k-nearest neighbors graph algorithms 16 Challenges Even with modern hardware, naive algorithms are not an option
  17. 17. Distributed k-nearest neighbors graph algorithms 17 Indexes Divide space in “zones” Example: ● North: pages 1, 2, 3 and 4 ● South: pages 5, 6, and 7
  18. 18. Distributed k-nearest neighbors graph algorithms 18 Indexes Similarity search with index “query” is near zone “SOUTH” => read pages 5, 6 and 7
  19. 19. Distributed k-nearest neighbors graph algorithms 19 Indexes : limitations Similarity search with index Requires to read multiple zones: 1d : 2 zones 2d : 4 zones 3d : 8 zones 8d : 256 zones “curse of dimensionality”
  20. 20. Distributed k-nearest neighbors graph algorithms 20 Indexes : limitations Great for low dimensional Euclidean datasets (time) But what about ● Higher dimensions? TV commercials: 4125 dimensions ● Text?
  21. 21. Distributed k-nearest neighbors graph algorithms 21 k-nn graph Can we use a k-nn graph for analyzing large datasets ?
  22. 22. Distributed k-nearest neighbors graph algorithms 22 k-nn graph Existing algorithms: ● Clustering ● Similarity search (but slow)
  23. 23. Distributed k-nearest neighbors graph algorithms 23 Outline Build from large text datasets ● Fast similarity search ● Add and remove points ● Applications: – Text clustering – Detection of compromised computers ● … using distributed processing!
  24. 24. Distributed k-nearest neighbors graph algorithms 24 Build from large text datasets
  25. 25. Distributed k-nearest neighbors graph algorithms 25 String similarity But first… how to measure similarity between strings? Lots of literature: ● Levenshtein ● Damerau ● Jaro-Winkler ● N-Gram ● Q-Gram ● Cosine ● Jaccard index ● … But no clean implementation!
  26. 26. Distributed k-nearest neighbors graph algorithms 26 String similarity
  27. 27. Distributed k-nearest neighbors graph algorithms 27 String similarity
  28. 28. Distributed k-nearest neighbors graph algorithms 28 String similarity
  29. 29. Distributed k-nearest neighbors graph algorithms 29 String similarity
  30. 30. Design and analysis of distributed k-nearest neighbors graph algorithms 30 Building from text datasets ● NN-Descent Build an approximate graph Compute O(n1.14) similarities ● BUT: iterative!
  31. 31. Distributed k-nearest neighbors graph algorithms 31 Building from text datasets NNCTPH ● Hash using modified hashing function CTPH / ssdeep / spamsum ● Build subgraphs in parallel ● Merge subgraphs Single iteration!
  32. 32. Distributed k-nearest neighbors graph algorithms 32 Building from text datasets
  33. 33. Distributed k-nearest neighbors graph algorithms 33 Building from text datasets ● Experimental evaluation: – Apache Hadoop MapReduce – SPAM dataset – Jaro-Winkler string similarity (not metric)
  34. 34. Distributed k-nearest neighbors graph algorithms 34 Building from text datasets
  35. 35. Distributed k-nearest neighbors graph algorithms 35 Fast similarity search Add and remove points
  36. 36. Distributed k-nearest neighbors graph algorithms 36 Online building ● Given a distributed graph: – Add nodes – Remove nodes – Search nearest neighbors of query node ● Requires k-medoids partitioning of graph
  37. 37. Distributed k-nearest neighbors graph algorithms 37 Partitioning ● k-medoids clustering ● CLARANS is slow to converge ● Two faster methods: – Inspired by Simulated Annealing – Heuristic ● Impact of partitioning when we perform distributed search
  38. 38. Distributed k-nearest neighbors graph algorithms 38 Applications
  39. 39. Distributed k-nearest neighbors graph algorithms 39 Text clustering ● Text dataset with Jaro-Winkler similarity (not a metric) ● Steps: – Build (approximate) k-nn graph – Prune – Compute connected components
  40. 40. Distributed k-nearest neighbors graph algorithms 40 APT Detection ● Advanced => no signatures ● Persistent => limited activity ● Threats ● Need a C2 channel
  41. 41. Distributed k-nearest neighbors graph algorithms 41 APT Detection
  42. 42. Distributed k-nearest neighbors graph algorithms 42 APT Detection Here: APT relying on HTTP => proxy logs
  43. 43. Distributed k-nearest neighbors graph algorithms 43 APT Detection How hard can that be?
  44. 44. Distributed k-nearest neighbors graph algorithms 44 APT Detection
  45. 45. Distributed k-nearest neighbors graph algorithms 45 APT Detection Displaying a page requires multiple HTTP requests => link each request to its parent using the logs from the proxy
  46. 46. Distributed k-nearest neighbors graph algorithms 46 APT Detection
  47. 47. Distributed k-nearest neighbors graph algorithms 47 APT Detection
  48. 48. Distributed k-nearest neighbors graph algorithms 48 APT Detection weight is higher if: ● Requests are close in time ● Requests belong to the same domain ● Same sequence repeats
  49. 49. Distributed k-nearest neighbors graph algorithms 49 APT Detection After pruning the weighted graph, the APT remains isolated!
  50. 50. Distributed k-nearest neighbors graph algorithms 50 APT Detection weight is higher if: ● Requests are close in time ● Requests belong to the same domain ● Same sequence repeats
  51. 51. Distributed k-nearest neighbors graph algorithms 51 APT Detection ● Batch: build graphs ● Interactive (web interface): – Merge – Prune – Cluster – Filter ● Approximate k-nn graph (time and memory)
  52. 52. Distributed k-nearest neighbors graph algorithms 52 APT Detection
  53. 53. Distributed k-nearest neighbors graph algorithms 53 APT Detection ● Experimental evaluation – Proxy logs of real network – Simulated APT traffic – Rank suspicious domains ● Results – High detection / false alarm ratio – Without prior knowledge about APT
  54. 54. Distributed k-nearest neighbors graph algorithms 54 APT Detection ● False positives: – Content Delivery Networks (CDN) – Advertising domains – Javascript library delivery – Websites with very few visits => same behavior as APT
  55. 55. Distributed k-nearest neighbors graph algorithms 55 Conclusion k-nn graph is an interesting tool to analyze large datasets, but ● Only if approximation is acceptable ● Other possibilities exist
  56. 56. Distributed k-nearest neighbors graph algorithms 56 Perspectives... ● Broaden to other graph-like structures: – (Hierarchical) Small World Network graphs – Asymmetrical graphs ● Broaden to other applications (clustering, nn search) ● Predict the magnitude of approximation
  57. 57. Distributed k-nearest neighbors graph algorithms 57 Questions... Cyber Defence Lab www.cylab.be

×