We've updated our privacy policy. Click here to review the details. Tap here to review the details.

Successfully reported this slideshow.

Your SlideShare is downloading.
×

Activate your 30 day free trial to unlock unlimited reading.

Activate your 30 day free trial to continue reading.

Upcoming SlideShare

Mining of massive datasets using locality sensitive hashing (LSH)

Loading in …3

×

Top clipped slide

1 of 57
Ad

Download to read offline

Similarity search is an essential component of machine learning algorithms. However, performing efficient similarity search can be extremely challenging, especially if the dataset is distributed between multiple computers, and even more if the similarity measure is not a metric. With the rise of Big Data processing, these challenging datasets are actually more and more common. In this presentation we show how k nearest neighbors (k-nn) graphs can be used to perform similarity search, clustering and anomaly detection.

Teaching assistant Operating systems, Distributed systems and Information security

- 1. Distributed k-nearest neighbors graph algorithms Thibault Debatty, Ir PhD 2019-12-03
- 2. Distributed k-nearest neighbors graph algorithms 2 k-nn graph Edge to k most similar nodes
- 3. Distributed k-nearest neighbors graph algorithms 3 Context Common tasks of machine learning, data mining, Artificial Intelligence or Big Data: ● Similarity search ● Clustering ● Anomaly detection
- 4. Distributed k-nearest neighbors graph algorithms 4 Context : similarity search
- 5. Distributed k-nearest neighbors graph algorithms 5 Context : similarity search “High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 v8g6” Similar to a known SPAM?
- 6. Distributed k-nearest neighbors graph algorithms 6 Context : clustering Kobe Bryant traded to Clippers No.1 Ma1eEnhancement Supplement. Trusted by Millions. Buy Today! J9 Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price 6xkp Percocet 10/625 mg withoutPrescription 30 tabs - $225! [20100815-3] rjj Nurses make Great Incomes Order Now! HYDROCODONE BRAND Watson 540 10mg/mg, 60 Pills - $479, 90 Pills - $656, 120 Pills - $838 36fy High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 7z3 Play here for summer fun Perfect Watches Clones Cheap from $150. Buy Rep1icaWatches: Swiss Rep1icaWatch xz High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 v69 Obtain details on your cred1t online. Get started today High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 071 Japanese food discount Is your computer safe? Luxury at a Discount! Phentermin 37.5 mg as cheap as 120 pills $366.00 8eg5 Mutant fish sold at Connecticut market High quality JBL speakers Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price xt High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 pl => Identify SPAM campaigns
- 7. Distributed k-nearest neighbors graph algorithms 7 Context : clustering To analyze 300 rogue websites: ● Cluster ● Analyze 1 representative of each group
- 8. Distributed k-nearest neighbors graph algorithms 8 Context : anomaly detection Find infected computer on a network
- 9. Distributed k-nearest neighbors graph algorithms 9 Context ● Similarity search ● Clustering and ● Anomaly detection … are crucial for data processing!
- 10. Distributed k-nearest neighbors graph algorithms 10 Challenges How hard can that be?
- 11. Distributed k-nearest neighbors graph algorithms 11 Challenges Computer memory is similar to a book ● Accessible by address (page) ● You have to read before you know the content (e.g. coordinates of a point)
- 12. Distributed k-nearest neighbors graph algorithms 12 Challenges Naive similarity search requires to read all pages
- 13. Distributed k-nearest neighbors graph algorithms 13 Challenges How many pages? Bible TOB: ● 2000 pages ● Extra thin paper ● 12cm ● 44 hours of reading
- 14. Distributed k-nearest neighbors graph algorithms 14 Challenges Samsung Galaxy S9 (4GB) 63m assuming 4KB/page (atomium = 102m), 2.6 years of reading...
- 15. Distributed k-nearest neighbors graph algorithms 15 Challenges Our server ● 1500GB ● 200.000 books ● A stack of 24km ● 1000 years of reading Brussels – Louvain la Neuve = 26km
- 16. Distributed k-nearest neighbors graph algorithms 16 Challenges Even with modern hardware, naive algorithms are not an option
- 17. Distributed k-nearest neighbors graph algorithms 17 Indexes Divide space in “zones” Example: ● North: pages 1, 2, 3 and 4 ● South: pages 5, 6, and 7
- 18. Distributed k-nearest neighbors graph algorithms 18 Indexes Similarity search with index “query” is near zone “SOUTH” => read pages 5, 6 and 7
- 19. Distributed k-nearest neighbors graph algorithms 19 Indexes : limitations Similarity search with index Requires to read multiple zones: 1d : 2 zones 2d : 4 zones 3d : 8 zones 8d : 256 zones “curse of dimensionality”
- 20. Distributed k-nearest neighbors graph algorithms 20 Indexes : limitations Great for low dimensional Euclidean datasets (time) But what about ● Higher dimensions? TV commercials: 4125 dimensions ● Text?
- 21. Distributed k-nearest neighbors graph algorithms 21 k-nn graph Can we use a k-nn graph for analyzing large datasets ?
- 22. Distributed k-nearest neighbors graph algorithms 22 k-nn graph Existing algorithms: ● Clustering ● Similarity search (but slow)
- 23. Distributed k-nearest neighbors graph algorithms 23 Outline Build from large text datasets ● Fast similarity search ● Add and remove points ● Applications: – Text clustering – Detection of compromised computers ● … using distributed processing!
- 24. Distributed k-nearest neighbors graph algorithms 24 Build from large text datasets
- 25. Distributed k-nearest neighbors graph algorithms 25 String similarity But first… how to measure similarity between strings? Lots of literature: ● Levenshtein ● Damerau ● Jaro-Winkler ● N-Gram ● Q-Gram ● Cosine ● Jaccard index ● … But no clean implementation!
- 26. Distributed k-nearest neighbors graph algorithms 26 String similarity
- 27. Distributed k-nearest neighbors graph algorithms 27 String similarity
- 28. Distributed k-nearest neighbors graph algorithms 28 String similarity
- 29. Distributed k-nearest neighbors graph algorithms 29 String similarity
- 30. Design and analysis of distributed k-nearest neighbors graph algorithms 30 Building from text datasets ● NN-Descent Build an approximate graph Compute O(n1.14) similarities ● BUT: iterative!
- 31. Distributed k-nearest neighbors graph algorithms 31 Building from text datasets NNCTPH ● Hash using modified hashing function CTPH / ssdeep / spamsum ● Build subgraphs in parallel ● Merge subgraphs Single iteration!
- 32. Distributed k-nearest neighbors graph algorithms 32 Building from text datasets
- 33. Distributed k-nearest neighbors graph algorithms 33 Building from text datasets ● Experimental evaluation: – Apache Hadoop MapReduce – SPAM dataset – Jaro-Winkler string similarity (not metric)
- 34. Distributed k-nearest neighbors graph algorithms 34 Building from text datasets
- 35. Distributed k-nearest neighbors graph algorithms 35 Fast similarity search Add and remove points
- 36. Distributed k-nearest neighbors graph algorithms 36 Online building ● Given a distributed graph: – Add nodes – Remove nodes – Search nearest neighbors of query node ● Requires k-medoids partitioning of graph
- 37. Distributed k-nearest neighbors graph algorithms 37 Partitioning ● k-medoids clustering ● CLARANS is slow to converge ● Two faster methods: – Inspired by Simulated Annealing – Heuristic ● Impact of partitioning when we perform distributed search
- 38. Distributed k-nearest neighbors graph algorithms 38 Applications
- 39. Distributed k-nearest neighbors graph algorithms 39 Text clustering ● Text dataset with Jaro-Winkler similarity (not a metric) ● Steps: – Build (approximate) k-nn graph – Prune – Compute connected components
- 40. Distributed k-nearest neighbors graph algorithms 40 APT Detection ● Advanced => no signatures ● Persistent => limited activity ● Threats ● Need a C2 channel
- 41. Distributed k-nearest neighbors graph algorithms 41 APT Detection
- 42. Distributed k-nearest neighbors graph algorithms 42 APT Detection Here: APT relying on HTTP => proxy logs
- 43. Distributed k-nearest neighbors graph algorithms 43 APT Detection How hard can that be?
- 44. Distributed k-nearest neighbors graph algorithms 44 APT Detection
- 45. Distributed k-nearest neighbors graph algorithms 45 APT Detection Displaying a page requires multiple HTTP requests => link each request to its parent using the logs from the proxy
- 46. Distributed k-nearest neighbors graph algorithms 46 APT Detection
- 47. Distributed k-nearest neighbors graph algorithms 47 APT Detection
- 48. Distributed k-nearest neighbors graph algorithms 48 APT Detection weight is higher if: ● Requests are close in time ● Requests belong to the same domain ● Same sequence repeats
- 49. Distributed k-nearest neighbors graph algorithms 49 APT Detection After pruning the weighted graph, the APT remains isolated!
- 50. Distributed k-nearest neighbors graph algorithms 50 APT Detection weight is higher if: ● Requests are close in time ● Requests belong to the same domain ● Same sequence repeats
- 51. Distributed k-nearest neighbors graph algorithms 51 APT Detection ● Batch: build graphs ● Interactive (web interface): – Merge – Prune – Cluster – Filter ● Approximate k-nn graph (time and memory)
- 52. Distributed k-nearest neighbors graph algorithms 52 APT Detection
- 53. Distributed k-nearest neighbors graph algorithms 53 APT Detection ● Experimental evaluation – Proxy logs of real network – Simulated APT traffic – Rank suspicious domains ● Results – High detection / false alarm ratio – Without prior knowledge about APT
- 54. Distributed k-nearest neighbors graph algorithms 54 APT Detection ● False positives: – Content Delivery Networks (CDN) – Advertising domains – Javascript library delivery – Websites with very few visits => same behavior as APT
- 55. Distributed k-nearest neighbors graph algorithms 55 Conclusion k-nn graph is an interesting tool to analyze large datasets, but ● Only if approximation is acceptable ● Other possibilities exist
- 56. Distributed k-nearest neighbors graph algorithms 56 Perspectives... ● Broaden to other graph-like structures: – (Hierarchical) Small World Network graphs – Asymmetrical graphs ● Broaden to other applications (clustering, nn search) ● Predict the magnitude of approximation
- 57. Distributed k-nearest neighbors graph algorithms 57 Questions... Cyber Defence Lab www.cylab.be

No public clipboards found for this slide

You just clipped your first slide!

Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips.Hate ads?

Enjoy access to millions of presentations, documents, ebooks, audiobooks, magazines, and more **ad-free.**

The SlideShare family just got bigger. Enjoy access to millions of ebooks, audiobooks, magazines, and more from Scribd.

Cancel anytime.
Be the first to like this

Total views

176

On SlideShare

0

From Embeds

0

Number of Embeds

14

Unlimited Reading

Learn faster and smarter from top experts

Unlimited Downloading

Download to take your learnings offline and on the go

You also get free access to Scribd!

Instant access to millions of ebooks, audiobooks, magazines, podcasts and more.

Read and listen offline with any device.

Free access to premium services like Tuneln, Mubi and more.

We’ve updated our privacy policy so that we are compliant with changing global privacy regulations and to provide you with insight into the limited ways in which we use your data.

You can read the details below. By accepting, you agree to the updated privacy policy.

Thank you!

We've encountered a problem, please try again.