SlideShare a Scribd company logo
Distributed k-nearest
neighbors graph algorithms
Thibault Debatty, Ir PhD
2019-12-03
Distributed k-nearest neighbors graph algorithms 2
k-nn graph
Edge to k most
similar nodes
Distributed k-nearest neighbors graph algorithms 3
Context
Common tasks of machine learning,
data mining, Artificial Intelligence
or Big Data:
●
Similarity search
●
Clustering
●
Anomaly detection
Distributed k-nearest neighbors graph algorithms 4
Context : similarity search
Distributed k-nearest neighbors graph algorithms 5
Context : similarity search
“High Qua1ityMedications Discount
On All Reorders = Best Deal Ever!
Viagra50/100mg - $1.85 v8g6”
Similar to a known SPAM?
Distributed k-nearest neighbors graph algorithms 6
Context : clustering
Kobe Bryant traded to Clippers
No.1 Ma1eEnhancement Supplement. Trusted by Millions. Buy Today! J9
Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price 6xkp
Percocet 10/625 mg withoutPrescription 30 tabs - $225! [20100815-3] rjj
Nurses make Great Incomes
Order Now! HYDROCODONE BRAND Watson 540 10mg/mg, 60 Pills - $479, 90 Pills - $656, 120 Pills - $838 36fy
High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 7z3
Play here for summer fun
Perfect Watches Clones Cheap from $150. Buy Rep1icaWatches: Swiss Rep1icaWatch xz
High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 v69
Obtain details on your cred1t online. Get started today
High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 071
Japanese food discount
Is your computer safe?
Luxury at a Discount!
Phentermin 37.5 mg as cheap as 120 pills $366.00 8eg5
Mutant fish sold at Connecticut market
High quality JBL speakers
Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price xt
High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 pl
=> Identify SPAM campaigns
Distributed k-nearest neighbors graph algorithms 7
Context : clustering
To analyze 300 rogue websites:
●
Cluster
●
Analyze 1 representative of each
group
Distributed k-nearest neighbors graph algorithms 8
Context : anomaly detection
Find infected computer on a network
Distributed k-nearest neighbors graph algorithms 9
Context
●
Similarity search
●
Clustering and
●
Anomaly detection
… are crucial for data processing!
Distributed k-nearest neighbors graph algorithms 10
Challenges
How hard can that be?
Distributed k-nearest neighbors graph algorithms 11
Challenges
Computer memory is similar to a book
●
Accessible by address (page)
●
You have to read before you know
the content (e.g. coordinates of a
point)
Distributed k-nearest neighbors graph algorithms 12
Challenges
Naive similarity
search requires to
read all pages
Distributed k-nearest neighbors graph algorithms 13
Challenges
How many pages?
Bible TOB:
●
2000 pages
●
Extra thin paper
●
12cm
●
44 hours of reading
Distributed k-nearest neighbors graph algorithms 14
Challenges
Samsung Galaxy S9 (4GB)
63m assuming 4KB/page (atomium = 102m), 2.6 years of reading...
Distributed k-nearest neighbors graph algorithms 15
Challenges
Our server
●
1500GB
●
200.000 books
●
A stack of 24km
●
1000 years of
reading
Brussels – Louvain la Neuve = 26km
Distributed k-nearest neighbors graph algorithms 16
Challenges
Even with modern hardware, naive
algorithms are not an option
Distributed k-nearest neighbors graph algorithms 17
Indexes
Divide space in
“zones”
Example:
●
North:
pages 1, 2, 3 and 4
●
South:
pages 5, 6, and 7
Distributed k-nearest neighbors graph algorithms 18
Indexes
Similarity search
with index
“query” is near zone
“SOUTH”
=> read pages 5, 6 and 7
Distributed k-nearest neighbors graph algorithms 19
Indexes : limitations
Similarity search
with index
Requires to read multiple
zones:
1d : 2 zones
2d : 4 zones
3d : 8 zones
8d : 256 zones
“curse of dimensionality”
Distributed k-nearest neighbors graph algorithms 20
Indexes : limitations
Great for low dimensional Euclidean
datasets (time)
But what about
●
Higher dimensions?
TV commercials: 4125 dimensions
●
Text?
Distributed k-nearest neighbors graph algorithms 21
k-nn graph
Can we use a k-nn graph for analyzing
large datasets ?
Distributed k-nearest neighbors graph algorithms 22
k-nn graph
Existing algorithms:
●
Clustering
●
Similarity search (but slow)
Distributed k-nearest neighbors graph algorithms 23
Outline
Build from large text datasets
●
Fast similarity search
●
Add and remove points
●
Applications:
– Text clustering
– Detection of compromised computers
●
… using distributed processing!
Distributed k-nearest neighbors graph algorithms 24
Build from large text datasets
Distributed k-nearest neighbors graph algorithms 25
String similarity
But first… how to measure similarity
between strings?
Lots of literature:
●
Levenshtein
●
Damerau
●
Jaro-Winkler
●
N-Gram
●
Q-Gram
●
Cosine
●
Jaccard index
●
…
But no clean implementation!
Distributed k-nearest neighbors graph algorithms 26
String similarity
Distributed k-nearest neighbors graph algorithms 27
String similarity
Distributed k-nearest neighbors graph algorithms 28
String similarity
Distributed k-nearest neighbors graph algorithms 29
String similarity
Design and analysis of distributed k-nearest neighbors graph algorithms 30
Building from text datasets
●
NN-Descent
Build an approximate graph
Compute O(n1.14) similarities
●
BUT: iterative!
Distributed k-nearest neighbors graph algorithms 31
Building from text datasets
NNCTPH
●
Hash using modified hashing
function
CTPH / ssdeep / spamsum
●
Build subgraphs in parallel
●
Merge subgraphs
Single iteration!
Distributed k-nearest neighbors graph algorithms 32
Building from text datasets
Distributed k-nearest neighbors graph algorithms 33
Building from text datasets
●
Experimental evaluation:
– Apache Hadoop MapReduce
– SPAM dataset
– Jaro-Winkler string similarity
(not metric)
Distributed k-nearest neighbors graph algorithms 34
Building from text datasets
Distributed k-nearest neighbors graph algorithms 35
Fast similarity search
Add and remove points
Distributed k-nearest neighbors graph algorithms 36
Online building
●
Given a distributed graph:
– Add nodes
– Remove nodes
– Search nearest neighbors of query node
●
Requires k-medoids partitioning of
graph
Distributed k-nearest neighbors graph algorithms 37
Partitioning
●
k-medoids clustering
●
CLARANS is slow to converge
●
Two faster methods:
– Inspired by Simulated Annealing
– Heuristic
●
Impact of partitioning when we
perform distributed search
Distributed k-nearest neighbors graph algorithms 38
Applications
Distributed k-nearest neighbors graph algorithms 39
Text clustering
●
Text dataset with Jaro-Winkler
similarity (not a metric)
●
Steps:
– Build (approximate) k-nn graph
– Prune
– Compute connected components
Distributed k-nearest neighbors graph algorithms 40
APT Detection
●
Advanced => no signatures
●
Persistent => limited activity
●
Threats
●
Need a C2 channel
Distributed k-nearest neighbors graph algorithms 41
APT Detection
Distributed k-nearest neighbors graph algorithms 42
APT Detection
Here:
APT relying on HTTP
=> proxy logs
Distributed k-nearest neighbors graph algorithms 43
APT Detection
How hard can that be?
Distributed k-nearest neighbors graph algorithms 44
APT Detection
Distributed k-nearest neighbors graph algorithms 45
APT Detection
Displaying a page requires multiple
HTTP requests
=> link each request to its parent
using the logs from the proxy
Distributed k-nearest neighbors graph algorithms 46
APT Detection
Distributed k-nearest neighbors graph algorithms 47
APT Detection
Distributed k-nearest neighbors graph algorithms 48
APT Detection
weight is higher if:
●
Requests are close in time
●
Requests belong to the same domain
●
Same sequence repeats
Distributed k-nearest neighbors graph algorithms 49
APT Detection
After pruning the weighted graph,
the APT remains isolated!
Distributed k-nearest neighbors graph algorithms 50
APT Detection
weight is higher if:
●
Requests are close in time
●
Requests belong to the same domain
●
Same sequence repeats
Distributed k-nearest neighbors graph algorithms 51
APT Detection
●
Batch: build graphs
●
Interactive (web interface):
– Merge
– Prune
– Cluster
– Filter
●
Approximate k-nn graph
(time and memory)
Distributed k-nearest neighbors graph algorithms 52
APT Detection
Distributed k-nearest neighbors graph algorithms 53
APT Detection
●
Experimental evaluation
– Proxy logs of real network
– Simulated APT traffic
– Rank suspicious domains
●
Results
– High detection / false alarm ratio
– Without prior knowledge about APT
Distributed k-nearest neighbors graph algorithms 54
APT Detection
●
False positives:
– Content Delivery Networks (CDN)
– Advertising domains
– Javascript library delivery
– Websites with very few visits
=> same behavior as APT
Distributed k-nearest neighbors graph algorithms 55
Conclusion
k-nn graph is an interesting tool to
analyze large datasets, but
●
Only if approximation is acceptable
●
Other possibilities exist
Distributed k-nearest neighbors graph algorithms 56
Perspectives...
●
Broaden to other graph-like
structures:
– (Hierarchical) Small World Network
graphs
– Asymmetrical graphs
●
Broaden to other applications
(clustering, nn search)
●
Predict the magnitude of
approximation
Distributed k-nearest neighbors graph algorithms 57
Questions...
Cyber Defence Lab
www.cylab.be

More Related Content

What's hot

Bloom filter
Bloom filterBloom filter
Bloom filter
Hamid Feizabadi
 
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
In-Memory Computing Summit
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
Noemi Derzsy
 
AINL 2016: Bugaychenko
AINL 2016: BugaychenkoAINL 2016: Bugaychenko
AINL 2016: Bugaychenko
Lidia Pivovarova
 
Cosequential processing and the sorting of large files
Cosequential processing and the sorting of large filesCosequential processing and the sorting of large files
Cosequential processing and the sorting of large files
Devyani Vaidya
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
National Institute of Informatics
 
Skytree big data london meetup - may 2013
Skytree   big data london meetup - may 2013Skytree   big data london meetup - may 2013
Skytree big data london meetup - may 2013
bigdatalondon
 
Spot Sigs
Spot SigsSpot Sigs
Spot Sigs
infoblog
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
Querying Linked Geospatial Data with Incomplete Information
Querying Linked Geospatial Data with  Incomplete InformationQuerying Linked Geospatial Data with  Incomplete Information
Querying Linked Geospatial Data with Incomplete Information
Charalampos (Babis) Nikolaou
 
Medical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSparkMedical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSpark
Helge Holzmann
 
MongoDB Hacks of Frustration
MongoDB Hacks of FrustrationMongoDB Hacks of Frustration
MongoDB Hacks of FrustrationMongoDB
 
Oslo
OsloOslo
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
Oleksandr Pryymak
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
MLconf
 
Hashing
HashingHashing
inteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access FrameworkinteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access Framework
National Inistitute of Informatics (NII), Tokyo, Japann
 
2017 biological databases_part1_vupload
2017 biological databases_part1_vupload2017 biological databases_part1_vupload
2017 biological databases_part1_vupload
Prof. Wim Van Criekinge
 

What's hot (18)

Bloom filter
Bloom filterBloom filter
Bloom filter
 
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
 
AINL 2016: Bugaychenko
AINL 2016: BugaychenkoAINL 2016: Bugaychenko
AINL 2016: Bugaychenko
 
Cosequential processing and the sorting of large files
Cosequential processing and the sorting of large filesCosequential processing and the sorting of large files
Cosequential processing and the sorting of large files
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
 
Skytree big data london meetup - may 2013
Skytree   big data london meetup - may 2013Skytree   big data london meetup - may 2013
Skytree big data london meetup - may 2013
 
Spot Sigs
Spot SigsSpot Sigs
Spot Sigs
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Querying Linked Geospatial Data with Incomplete Information
Querying Linked Geospatial Data with  Incomplete InformationQuerying Linked Geospatial Data with  Incomplete Information
Querying Linked Geospatial Data with Incomplete Information
 
Medical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSparkMedical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSpark
 
MongoDB Hacks of Frustration
MongoDB Hacks of FrustrationMongoDB Hacks of Frustration
MongoDB Hacks of Frustration
 
Oslo
OsloOslo
Oslo
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Hashing
HashingHashing
Hashing
 
inteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access FrameworkinteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access Framework
 
2017 biological databases_part1_vupload
2017 biological databases_part1_vupload2017 biological databases_part1_vupload
2017 biological databases_part1_vupload
 

Similar to An introduction to similarity search and k-nn graphs

Design and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithmsDesign and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithms
Thibault Debatty
 
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph AlgorithmsNeo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
Neo4j
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
Jason Riedy
 
0629venmoplus
0629venmoplus0629venmoplus
0629venmoplus
Qingpeng "Q.P." Zhang
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
ehtshamelahi
 
ASE2010
ASE2010ASE2010
ASE2010
swy351
 
Machine learning by using python By: Professor Lili Saghafi
Machine learning by using python By: Professor Lili SaghafiMachine learning by using python By: Professor Lili Saghafi
Machine learning by using python By: Professor Lili Saghafi
Professor Lili Saghafi
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
Sujit Pal
 
Performance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4jPerformance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4j
ArangoDB Database
 
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Michail Argyriou
 
VenmoPlus demo week6
VenmoPlus demo week6VenmoPlus demo week6
VenmoPlus demo week6
Qingpeng "Q.P." Zhang
 
Artificial Intelligence Overview
Artificial Intelligence OverviewArtificial Intelligence Overview
Artificial Intelligence Overview
Suraj Kumar Jana
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
Grisha Weintraub
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Chris Fregly
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
MapR Technologies
 
Neo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExp
Adrian Ziegler
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
Yves Raimond
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
Thamme Gowda
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architectureMarkus Klems
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
Aapo Kyrölä
 

Similar to An introduction to similarity search and k-nn graphs (20)

Design and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithmsDesign and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithms
 
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph AlgorithmsNeo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
0629venmoplus
0629venmoplus0629venmoplus
0629venmoplus
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
 
ASE2010
ASE2010ASE2010
ASE2010
 
Machine learning by using python By: Professor Lili Saghafi
Machine learning by using python By: Professor Lili SaghafiMachine learning by using python By: Professor Lili Saghafi
Machine learning by using python By: Professor Lili Saghafi
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
Performance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4jPerformance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4j
 
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
 
VenmoPlus demo week6
VenmoPlus demo week6VenmoPlus demo week6
VenmoPlus demo week6
 
Artificial Intelligence Overview
Artificial Intelligence OverviewArtificial Intelligence Overview
Artificial Intelligence Overview
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Neo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExp
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 

More from Thibault Debatty

Blockchain for dummies
Blockchain for dummiesBlockchain for dummies
Blockchain for dummies
Thibault Debatty
 
Building a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation AwarenessBuilding a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation Awareness
Thibault Debatty
 
A comparative analysis of visualisation techniques to achieve CySA in the mi...
A comparative analysis of visualisation techniques to achieve CySA in the  mi...A comparative analysis of visualisation techniques to achieve CySA in the  mi...
A comparative analysis of visualisation techniques to achieve CySA in the mi...
Thibault Debatty
 
Cyber Range
Cyber RangeCyber Range
Cyber Range
Thibault Debatty
 
Easy Server Monitoring
Easy Server MonitoringEasy Server Monitoring
Easy Server Monitoring
Thibault Debatty
 
Data diode
Data diodeData diode
Data diode
Thibault Debatty
 
USB Portal
USB PortalUSB Portal
USB Portal
Thibault Debatty
 
Smart Router
Smart RouterSmart Router
Smart Router
Thibault Debatty
 
Web shell detector
Web shell detectorWeb shell detector
Web shell detector
Thibault Debatty
 
Graph based APT detection
Graph based APT detectionGraph based APT detection
Graph based APT detection
Thibault Debatty
 
Multi-Agent System for APT Detection
Multi-Agent System for APT DetectionMulti-Agent System for APT Detection
Multi-Agent System for APT Detection
Thibault Debatty
 
Building k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text DataBuilding k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text Data
Thibault Debatty
 
Determining the k in k-means with MapReduce
Determining the k in k-means with MapReduceDetermining the k in k-means with MapReduce
Determining the k in k-means with MapReduce
Thibault Debatty
 
Parallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopParallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopThibault Debatty
 

More from Thibault Debatty (14)

Blockchain for dummies
Blockchain for dummiesBlockchain for dummies
Blockchain for dummies
 
Building a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation AwarenessBuilding a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation Awareness
 
A comparative analysis of visualisation techniques to achieve CySA in the mi...
A comparative analysis of visualisation techniques to achieve CySA in the  mi...A comparative analysis of visualisation techniques to achieve CySA in the  mi...
A comparative analysis of visualisation techniques to achieve CySA in the mi...
 
Cyber Range
Cyber RangeCyber Range
Cyber Range
 
Easy Server Monitoring
Easy Server MonitoringEasy Server Monitoring
Easy Server Monitoring
 
Data diode
Data diodeData diode
Data diode
 
USB Portal
USB PortalUSB Portal
USB Portal
 
Smart Router
Smart RouterSmart Router
Smart Router
 
Web shell detector
Web shell detectorWeb shell detector
Web shell detector
 
Graph based APT detection
Graph based APT detectionGraph based APT detection
Graph based APT detection
 
Multi-Agent System for APT Detection
Multi-Agent System for APT DetectionMulti-Agent System for APT Detection
Multi-Agent System for APT Detection
 
Building k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text DataBuilding k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text Data
 
Determining the k in k-means with MapReduce
Determining the k in k-means with MapReduceDetermining the k in k-means with MapReduce
Determining the k in k-means with MapReduce
 
Parallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopParallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with Hadoop
 

Recently uploaded

ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), EligibilityISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
SciAstra
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
frank0071
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 

Recently uploaded (20)

ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), EligibilityISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 

An introduction to similarity search and k-nn graphs

  • 1. Distributed k-nearest neighbors graph algorithms Thibault Debatty, Ir PhD 2019-12-03
  • 2. Distributed k-nearest neighbors graph algorithms 2 k-nn graph Edge to k most similar nodes
  • 3. Distributed k-nearest neighbors graph algorithms 3 Context Common tasks of machine learning, data mining, Artificial Intelligence or Big Data: ● Similarity search ● Clustering ● Anomaly detection
  • 4. Distributed k-nearest neighbors graph algorithms 4 Context : similarity search
  • 5. Distributed k-nearest neighbors graph algorithms 5 Context : similarity search “High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 v8g6” Similar to a known SPAM?
  • 6. Distributed k-nearest neighbors graph algorithms 6 Context : clustering Kobe Bryant traded to Clippers No.1 Ma1eEnhancement Supplement. Trusted by Millions. Buy Today! J9 Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price 6xkp Percocet 10/625 mg withoutPrescription 30 tabs - $225! [20100815-3] rjj Nurses make Great Incomes Order Now! HYDROCODONE BRAND Watson 540 10mg/mg, 60 Pills - $479, 90 Pills - $656, 120 Pills - $838 36fy High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 7z3 Play here for summer fun Perfect Watches Clones Cheap from $150. Buy Rep1icaWatches: Swiss Rep1icaWatch xz High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 v69 Obtain details on your cred1t online. Get started today High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 071 Japanese food discount Is your computer safe? Luxury at a Discount! Phentermin 37.5 mg as cheap as 120 pills $366.00 8eg5 Mutant fish sold at Connecticut market High quality JBL speakers Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price xt High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 pl => Identify SPAM campaigns
  • 7. Distributed k-nearest neighbors graph algorithms 7 Context : clustering To analyze 300 rogue websites: ● Cluster ● Analyze 1 representative of each group
  • 8. Distributed k-nearest neighbors graph algorithms 8 Context : anomaly detection Find infected computer on a network
  • 9. Distributed k-nearest neighbors graph algorithms 9 Context ● Similarity search ● Clustering and ● Anomaly detection … are crucial for data processing!
  • 10. Distributed k-nearest neighbors graph algorithms 10 Challenges How hard can that be?
  • 11. Distributed k-nearest neighbors graph algorithms 11 Challenges Computer memory is similar to a book ● Accessible by address (page) ● You have to read before you know the content (e.g. coordinates of a point)
  • 12. Distributed k-nearest neighbors graph algorithms 12 Challenges Naive similarity search requires to read all pages
  • 13. Distributed k-nearest neighbors graph algorithms 13 Challenges How many pages? Bible TOB: ● 2000 pages ● Extra thin paper ● 12cm ● 44 hours of reading
  • 14. Distributed k-nearest neighbors graph algorithms 14 Challenges Samsung Galaxy S9 (4GB) 63m assuming 4KB/page (atomium = 102m), 2.6 years of reading...
  • 15. Distributed k-nearest neighbors graph algorithms 15 Challenges Our server ● 1500GB ● 200.000 books ● A stack of 24km ● 1000 years of reading Brussels – Louvain la Neuve = 26km
  • 16. Distributed k-nearest neighbors graph algorithms 16 Challenges Even with modern hardware, naive algorithms are not an option
  • 17. Distributed k-nearest neighbors graph algorithms 17 Indexes Divide space in “zones” Example: ● North: pages 1, 2, 3 and 4 ● South: pages 5, 6, and 7
  • 18. Distributed k-nearest neighbors graph algorithms 18 Indexes Similarity search with index “query” is near zone “SOUTH” => read pages 5, 6 and 7
  • 19. Distributed k-nearest neighbors graph algorithms 19 Indexes : limitations Similarity search with index Requires to read multiple zones: 1d : 2 zones 2d : 4 zones 3d : 8 zones 8d : 256 zones “curse of dimensionality”
  • 20. Distributed k-nearest neighbors graph algorithms 20 Indexes : limitations Great for low dimensional Euclidean datasets (time) But what about ● Higher dimensions? TV commercials: 4125 dimensions ● Text?
  • 21. Distributed k-nearest neighbors graph algorithms 21 k-nn graph Can we use a k-nn graph for analyzing large datasets ?
  • 22. Distributed k-nearest neighbors graph algorithms 22 k-nn graph Existing algorithms: ● Clustering ● Similarity search (but slow)
  • 23. Distributed k-nearest neighbors graph algorithms 23 Outline Build from large text datasets ● Fast similarity search ● Add and remove points ● Applications: – Text clustering – Detection of compromised computers ● … using distributed processing!
  • 24. Distributed k-nearest neighbors graph algorithms 24 Build from large text datasets
  • 25. Distributed k-nearest neighbors graph algorithms 25 String similarity But first… how to measure similarity between strings? Lots of literature: ● Levenshtein ● Damerau ● Jaro-Winkler ● N-Gram ● Q-Gram ● Cosine ● Jaccard index ● … But no clean implementation!
  • 26. Distributed k-nearest neighbors graph algorithms 26 String similarity
  • 27. Distributed k-nearest neighbors graph algorithms 27 String similarity
  • 28. Distributed k-nearest neighbors graph algorithms 28 String similarity
  • 29. Distributed k-nearest neighbors graph algorithms 29 String similarity
  • 30. Design and analysis of distributed k-nearest neighbors graph algorithms 30 Building from text datasets ● NN-Descent Build an approximate graph Compute O(n1.14) similarities ● BUT: iterative!
  • 31. Distributed k-nearest neighbors graph algorithms 31 Building from text datasets NNCTPH ● Hash using modified hashing function CTPH / ssdeep / spamsum ● Build subgraphs in parallel ● Merge subgraphs Single iteration!
  • 32. Distributed k-nearest neighbors graph algorithms 32 Building from text datasets
  • 33. Distributed k-nearest neighbors graph algorithms 33 Building from text datasets ● Experimental evaluation: – Apache Hadoop MapReduce – SPAM dataset – Jaro-Winkler string similarity (not metric)
  • 34. Distributed k-nearest neighbors graph algorithms 34 Building from text datasets
  • 35. Distributed k-nearest neighbors graph algorithms 35 Fast similarity search Add and remove points
  • 36. Distributed k-nearest neighbors graph algorithms 36 Online building ● Given a distributed graph: – Add nodes – Remove nodes – Search nearest neighbors of query node ● Requires k-medoids partitioning of graph
  • 37. Distributed k-nearest neighbors graph algorithms 37 Partitioning ● k-medoids clustering ● CLARANS is slow to converge ● Two faster methods: – Inspired by Simulated Annealing – Heuristic ● Impact of partitioning when we perform distributed search
  • 38. Distributed k-nearest neighbors graph algorithms 38 Applications
  • 39. Distributed k-nearest neighbors graph algorithms 39 Text clustering ● Text dataset with Jaro-Winkler similarity (not a metric) ● Steps: – Build (approximate) k-nn graph – Prune – Compute connected components
  • 40. Distributed k-nearest neighbors graph algorithms 40 APT Detection ● Advanced => no signatures ● Persistent => limited activity ● Threats ● Need a C2 channel
  • 41. Distributed k-nearest neighbors graph algorithms 41 APT Detection
  • 42. Distributed k-nearest neighbors graph algorithms 42 APT Detection Here: APT relying on HTTP => proxy logs
  • 43. Distributed k-nearest neighbors graph algorithms 43 APT Detection How hard can that be?
  • 44. Distributed k-nearest neighbors graph algorithms 44 APT Detection
  • 45. Distributed k-nearest neighbors graph algorithms 45 APT Detection Displaying a page requires multiple HTTP requests => link each request to its parent using the logs from the proxy
  • 46. Distributed k-nearest neighbors graph algorithms 46 APT Detection
  • 47. Distributed k-nearest neighbors graph algorithms 47 APT Detection
  • 48. Distributed k-nearest neighbors graph algorithms 48 APT Detection weight is higher if: ● Requests are close in time ● Requests belong to the same domain ● Same sequence repeats
  • 49. Distributed k-nearest neighbors graph algorithms 49 APT Detection After pruning the weighted graph, the APT remains isolated!
  • 50. Distributed k-nearest neighbors graph algorithms 50 APT Detection weight is higher if: ● Requests are close in time ● Requests belong to the same domain ● Same sequence repeats
  • 51. Distributed k-nearest neighbors graph algorithms 51 APT Detection ● Batch: build graphs ● Interactive (web interface): – Merge – Prune – Cluster – Filter ● Approximate k-nn graph (time and memory)
  • 52. Distributed k-nearest neighbors graph algorithms 52 APT Detection
  • 53. Distributed k-nearest neighbors graph algorithms 53 APT Detection ● Experimental evaluation – Proxy logs of real network – Simulated APT traffic – Rank suspicious domains ● Results – High detection / false alarm ratio – Without prior knowledge about APT
  • 54. Distributed k-nearest neighbors graph algorithms 54 APT Detection ● False positives: – Content Delivery Networks (CDN) – Advertising domains – Javascript library delivery – Websites with very few visits => same behavior as APT
  • 55. Distributed k-nearest neighbors graph algorithms 55 Conclusion k-nn graph is an interesting tool to analyze large datasets, but ● Only if approximation is acceptable ● Other possibilities exist
  • 56. Distributed k-nearest neighbors graph algorithms 56 Perspectives... ● Broaden to other graph-like structures: – (Hierarchical) Small World Network graphs – Asymmetrical graphs ● Broaden to other applications (clustering, nn search) ● Predict the magnitude of approximation
  • 57. Distributed k-nearest neighbors graph algorithms 57 Questions... Cyber Defence Lab www.cylab.be