SlideShare a Scribd company logo
Distributed k-nearest
neighbors graph algorithms
Thibault Debatty, Ir PhD
2019-12-03
Distributed k-nearest neighbors graph algorithms 2
k-nn graph
Edge to k most
similar nodes
Distributed k-nearest neighbors graph algorithms 3
Context
Common tasks of machine learning,
data mining, Artificial Intelligence
or Big Data:
●
Similarity search
●
Clustering
●
Anomaly detection
Distributed k-nearest neighbors graph algorithms 4
Context : similarity search
Distributed k-nearest neighbors graph algorithms 5
Context : similarity search
“High Qua1ityMedications Discount
On All Reorders = Best Deal Ever!
Viagra50/100mg - $1.85 v8g6”
Similar to a known SPAM?
Distributed k-nearest neighbors graph algorithms 6
Context : clustering
Kobe Bryant traded to Clippers
No.1 Ma1eEnhancement Supplement. Trusted by Millions. Buy Today! J9
Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price 6xkp
Percocet 10/625 mg withoutPrescription 30 tabs - $225! [20100815-3] rjj
Nurses make Great Incomes
Order Now! HYDROCODONE BRAND Watson 540 10mg/mg, 60 Pills - $479, 90 Pills - $656, 120 Pills - $838 36fy
High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 7z3
Play here for summer fun
Perfect Watches Clones Cheap from $150. Buy Rep1icaWatches: Swiss Rep1icaWatch xz
High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 v69
Obtain details on your cred1t online. Get started today
High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 071
Japanese food discount
Is your computer safe?
Luxury at a Discount!
Phentermin 37.5 mg as cheap as 120 pills $366.00 8eg5
Mutant fish sold at Connecticut market
High quality JBL speakers
Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price xt
High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 pl
=> Identify SPAM campaigns
Distributed k-nearest neighbors graph algorithms 7
Context : clustering
To analyze 300 rogue websites:
●
Cluster
●
Analyze 1 representative of each
group
Distributed k-nearest neighbors graph algorithms 8
Context : anomaly detection
Find infected computer on a network
Distributed k-nearest neighbors graph algorithms 9
Context
●
Similarity search
●
Clustering and
●
Anomaly detection
… are crucial for data processing!
Distributed k-nearest neighbors graph algorithms 10
Challenges
How hard can that be?
Distributed k-nearest neighbors graph algorithms 11
Challenges
Computer memory is similar to a book
●
Accessible by address (page)
●
You have to read before you know
the content (e.g. coordinates of a
point)
Distributed k-nearest neighbors graph algorithms 12
Challenges
Naive similarity
search requires to
read all pages
Distributed k-nearest neighbors graph algorithms 13
Challenges
How many pages?
Bible TOB:
●
2000 pages
●
Extra thin paper
●
12cm
●
44 hours of reading
Distributed k-nearest neighbors graph algorithms 14
Challenges
Samsung Galaxy S9 (4GB)
63m assuming 4KB/page (atomium = 102m), 2.6 years of reading...
Distributed k-nearest neighbors graph algorithms 15
Challenges
Our server
●
1500GB
●
200.000 books
●
A stack of 24km
●
1000 years of
reading
Brussels – Louvain la Neuve = 26km
Distributed k-nearest neighbors graph algorithms 16
Challenges
Even with modern hardware, naive
algorithms are not an option
Distributed k-nearest neighbors graph algorithms 17
Indexes
Divide space in
“zones”
Example:
●
North:
pages 1, 2, 3 and 4
●
South:
pages 5, 6, and 7
Distributed k-nearest neighbors graph algorithms 18
Indexes
Similarity search
with index
“query” is near zone
“SOUTH”
=> read pages 5, 6 and 7
Distributed k-nearest neighbors graph algorithms 19
Indexes : limitations
Similarity search
with index
Requires to read multiple
zones:
1d : 2 zones
2d : 4 zones
3d : 8 zones
8d : 256 zones
“curse of dimensionality”
Distributed k-nearest neighbors graph algorithms 20
Indexes : limitations
Great for low dimensional Euclidean
datasets (time)
But what about
●
Higher dimensions?
TV commercials: 4125 dimensions
●
Text?
Distributed k-nearest neighbors graph algorithms 21
k-nn graph
Can we use a k-nn graph for analyzing
large datasets ?
Distributed k-nearest neighbors graph algorithms 22
k-nn graph
Existing algorithms:
●
Clustering
●
Similarity search (but slow)
Distributed k-nearest neighbors graph algorithms 23
Outline
Build from large text datasets
●
Fast similarity search
●
Add and remove points
●
Applications:
– Text clustering
– Detection of compromised computers
●
… using distributed processing!
Distributed k-nearest neighbors graph algorithms 24
Build from large text datasets
Distributed k-nearest neighbors graph algorithms 25
String similarity
But first… how to measure similarity
between strings?
Lots of literature:
●
Levenshtein
●
Damerau
●
Jaro-Winkler
●
N-Gram
●
Q-Gram
●
Cosine
●
Jaccard index
●
…
But no clean implementation!
Distributed k-nearest neighbors graph algorithms 26
String similarity
Distributed k-nearest neighbors graph algorithms 27
String similarity
Distributed k-nearest neighbors graph algorithms 28
String similarity
Distributed k-nearest neighbors graph algorithms 29
String similarity
Design and analysis of distributed k-nearest neighbors graph algorithms 30
Building from text datasets
●
NN-Descent
Build an approximate graph
Compute O(n1.14) similarities
●
BUT: iterative!
Distributed k-nearest neighbors graph algorithms 31
Building from text datasets
NNCTPH
●
Hash using modified hashing
function
CTPH / ssdeep / spamsum
●
Build subgraphs in parallel
●
Merge subgraphs
Single iteration!
Distributed k-nearest neighbors graph algorithms 32
Building from text datasets
Distributed k-nearest neighbors graph algorithms 33
Building from text datasets
●
Experimental evaluation:
– Apache Hadoop MapReduce
– SPAM dataset
– Jaro-Winkler string similarity
(not metric)
Distributed k-nearest neighbors graph algorithms 34
Building from text datasets
Distributed k-nearest neighbors graph algorithms 35
Fast similarity search
Add and remove points
Distributed k-nearest neighbors graph algorithms 36
Online building
●
Given a distributed graph:
– Add nodes
– Remove nodes
– Search nearest neighbors of query node
●
Requires k-medoids partitioning of
graph
Distributed k-nearest neighbors graph algorithms 37
Partitioning
●
k-medoids clustering
●
CLARANS is slow to converge
●
Two faster methods:
– Inspired by Simulated Annealing
– Heuristic
●
Impact of partitioning when we
perform distributed search
Distributed k-nearest neighbors graph algorithms 38
Applications
Distributed k-nearest neighbors graph algorithms 39
Text clustering
●
Text dataset with Jaro-Winkler
similarity (not a metric)
●
Steps:
– Build (approximate) k-nn graph
– Prune
– Compute connected components
Distributed k-nearest neighbors graph algorithms 40
APT Detection
●
Advanced => no signatures
●
Persistent => limited activity
●
Threats
●
Need a C2 channel
Distributed k-nearest neighbors graph algorithms 41
APT Detection
Distributed k-nearest neighbors graph algorithms 42
APT Detection
Here:
APT relying on HTTP
=> proxy logs
Distributed k-nearest neighbors graph algorithms 43
APT Detection
How hard can that be?
Distributed k-nearest neighbors graph algorithms 44
APT Detection
Distributed k-nearest neighbors graph algorithms 45
APT Detection
Displaying a page requires multiple
HTTP requests
=> link each request to its parent
using the logs from the proxy
Distributed k-nearest neighbors graph algorithms 46
APT Detection
Distributed k-nearest neighbors graph algorithms 47
APT Detection
Distributed k-nearest neighbors graph algorithms 48
APT Detection
weight is higher if:
●
Requests are close in time
●
Requests belong to the same domain
●
Same sequence repeats
Distributed k-nearest neighbors graph algorithms 49
APT Detection
After pruning the weighted graph,
the APT remains isolated!
Distributed k-nearest neighbors graph algorithms 50
APT Detection
weight is higher if:
●
Requests are close in time
●
Requests belong to the same domain
●
Same sequence repeats
Distributed k-nearest neighbors graph algorithms 51
APT Detection
●
Batch: build graphs
●
Interactive (web interface):
– Merge
– Prune
– Cluster
– Filter
●
Approximate k-nn graph
(time and memory)
Distributed k-nearest neighbors graph algorithms 52
APT Detection
Distributed k-nearest neighbors graph algorithms 53
APT Detection
●
Experimental evaluation
– Proxy logs of real network
– Simulated APT traffic
– Rank suspicious domains
●
Results
– High detection / false alarm ratio
– Without prior knowledge about APT
Distributed k-nearest neighbors graph algorithms 54
APT Detection
●
False positives:
– Content Delivery Networks (CDN)
– Advertising domains
– Javascript library delivery
– Websites with very few visits
=> same behavior as APT
Distributed k-nearest neighbors graph algorithms 55
Conclusion
k-nn graph is an interesting tool to
analyze large datasets, but
●
Only if approximation is acceptable
●
Other possibilities exist
Distributed k-nearest neighbors graph algorithms 56
Perspectives...
●
Broaden to other graph-like
structures:
– (Hierarchical) Small World Network
graphs
– Asymmetrical graphs
●
Broaden to other applications
(clustering, nn search)
●
Predict the magnitude of
approximation
Distributed k-nearest neighbors graph algorithms 57
Questions...
Cyber Defence Lab
www.cylab.be

More Related Content

What's hot

Bloom filter
Bloom filterBloom filter
Bloom filter
Hamid Feizabadi
 
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
In-Memory Computing Summit
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
Noemi Derzsy
 
AINL 2016: Bugaychenko
AINL 2016: BugaychenkoAINL 2016: Bugaychenko
AINL 2016: Bugaychenko
Lidia Pivovarova
 
Cosequential processing and the sorting of large files
Cosequential processing and the sorting of large filesCosequential processing and the sorting of large files
Cosequential processing and the sorting of large files
Devyani Vaidya
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
National Institute of Informatics
 
Skytree big data london meetup - may 2013
Skytree   big data london meetup - may 2013Skytree   big data london meetup - may 2013
Skytree big data london meetup - may 2013
bigdatalondon
 
Spot Sigs
Spot SigsSpot Sigs
Spot Sigs
infoblog
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
Querying Linked Geospatial Data with Incomplete Information
Querying Linked Geospatial Data with  Incomplete InformationQuerying Linked Geospatial Data with  Incomplete Information
Querying Linked Geospatial Data with Incomplete Information
Charalampos (Babis) Nikolaou
 
Medical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSparkMedical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSpark
Helge Holzmann
 
MongoDB Hacks of Frustration
MongoDB Hacks of FrustrationMongoDB Hacks of Frustration
MongoDB Hacks of Frustration
MongoDB
 
Oslo
OsloOslo
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
Oleksandr Pryymak
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
MLconf
 
Hashing
HashingHashing
inteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access FrameworkinteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access Framework
National Inistitute of Informatics (NII), Tokyo, Japann
 
2017 biological databases_part1_vupload
2017 biological databases_part1_vupload2017 biological databases_part1_vupload
2017 biological databases_part1_vupload
Prof. Wim Van Criekinge
 

What's hot (18)

Bloom filter
Bloom filterBloom filter
Bloom filter
 
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
 
AINL 2016: Bugaychenko
AINL 2016: BugaychenkoAINL 2016: Bugaychenko
AINL 2016: Bugaychenko
 
Cosequential processing and the sorting of large files
Cosequential processing and the sorting of large filesCosequential processing and the sorting of large files
Cosequential processing and the sorting of large files
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
 
Skytree big data london meetup - may 2013
Skytree   big data london meetup - may 2013Skytree   big data london meetup - may 2013
Skytree big data london meetup - may 2013
 
Spot Sigs
Spot SigsSpot Sigs
Spot Sigs
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Querying Linked Geospatial Data with Incomplete Information
Querying Linked Geospatial Data with  Incomplete InformationQuerying Linked Geospatial Data with  Incomplete Information
Querying Linked Geospatial Data with Incomplete Information
 
Medical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSparkMedical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSpark
 
MongoDB Hacks of Frustration
MongoDB Hacks of FrustrationMongoDB Hacks of Frustration
MongoDB Hacks of Frustration
 
Oslo
OsloOslo
Oslo
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Hashing
HashingHashing
Hashing
 
inteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access FrameworkinteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access Framework
 
2017 biological databases_part1_vupload
2017 biological databases_part1_vupload2017 biological databases_part1_vupload
2017 biological databases_part1_vupload
 

Similar to An introduction to similarity search and k-nn graphs

Design and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithmsDesign and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithms
Thibault Debatty
 
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph AlgorithmsNeo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
Neo4j
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
Jason Riedy
 
0629venmoplus
0629venmoplus0629venmoplus
0629venmoplus
Qingpeng "Q.P." Zhang
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
ehtshamelahi
 
ASE2010
ASE2010ASE2010
ASE2010
swy351
 
Machine learning by using python By: Professor Lili Saghafi
Machine learning by using python By: Professor Lili SaghafiMachine learning by using python By: Professor Lili Saghafi
Machine learning by using python By: Professor Lili Saghafi
Professor Lili Saghafi
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
Sujit Pal
 
Performance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4jPerformance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4j
ArangoDB Database
 
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Michail Argyriou
 
VenmoPlus demo week6
VenmoPlus demo week6VenmoPlus demo week6
VenmoPlus demo week6
Qingpeng "Q.P." Zhang
 
Artificial Intelligence Overview
Artificial Intelligence OverviewArtificial Intelligence Overview
Artificial Intelligence Overview
Suraj Kumar Jana
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
Grisha Weintraub
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Chris Fregly
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
MapR Technologies
 
Neo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExp
Adrian Ziegler
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
Yves Raimond
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
Thamme Gowda
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
Markus Klems
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
Aapo Kyrölä
 

Similar to An introduction to similarity search and k-nn graphs (20)

Design and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithmsDesign and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithms
 
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph AlgorithmsNeo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
0629venmoplus
0629venmoplus0629venmoplus
0629venmoplus
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
 
ASE2010
ASE2010ASE2010
ASE2010
 
Machine learning by using python By: Professor Lili Saghafi
Machine learning by using python By: Professor Lili SaghafiMachine learning by using python By: Professor Lili Saghafi
Machine learning by using python By: Professor Lili Saghafi
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
Performance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4jPerformance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4j
 
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
 
VenmoPlus demo week6
VenmoPlus demo week6VenmoPlus demo week6
VenmoPlus demo week6
 
Artificial Intelligence Overview
Artificial Intelligence OverviewArtificial Intelligence Overview
Artificial Intelligence Overview
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Neo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExp
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 

More from Thibault Debatty

Blockchain for dummies
Blockchain for dummiesBlockchain for dummies
Blockchain for dummies
Thibault Debatty
 
Building a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation AwarenessBuilding a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation Awareness
Thibault Debatty
 
A comparative analysis of visualisation techniques to achieve CySA in the mi...
A comparative analysis of visualisation techniques to achieve CySA in the  mi...A comparative analysis of visualisation techniques to achieve CySA in the  mi...
A comparative analysis of visualisation techniques to achieve CySA in the mi...
Thibault Debatty
 
Cyber Range
Cyber RangeCyber Range
Cyber Range
Thibault Debatty
 
Easy Server Monitoring
Easy Server MonitoringEasy Server Monitoring
Easy Server Monitoring
Thibault Debatty
 
Data diode
Data diodeData diode
Data diode
Thibault Debatty
 
USB Portal
USB PortalUSB Portal
USB Portal
Thibault Debatty
 
Smart Router
Smart RouterSmart Router
Smart Router
Thibault Debatty
 
Web shell detector
Web shell detectorWeb shell detector
Web shell detector
Thibault Debatty
 
Graph based APT detection
Graph based APT detectionGraph based APT detection
Graph based APT detection
Thibault Debatty
 
Multi-Agent System for APT Detection
Multi-Agent System for APT DetectionMulti-Agent System for APT Detection
Multi-Agent System for APT Detection
Thibault Debatty
 
Building k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text DataBuilding k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text Data
Thibault Debatty
 
Determining the k in k-means with MapReduce
Determining the k in k-means with MapReduceDetermining the k in k-means with MapReduce
Determining the k in k-means with MapReduce
Thibault Debatty
 
Parallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopParallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with Hadoop
Thibault Debatty
 

More from Thibault Debatty (14)

Blockchain for dummies
Blockchain for dummiesBlockchain for dummies
Blockchain for dummies
 
Building a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation AwarenessBuilding a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation Awareness
 
A comparative analysis of visualisation techniques to achieve CySA in the mi...
A comparative analysis of visualisation techniques to achieve CySA in the  mi...A comparative analysis of visualisation techniques to achieve CySA in the  mi...
A comparative analysis of visualisation techniques to achieve CySA in the mi...
 
Cyber Range
Cyber RangeCyber Range
Cyber Range
 
Easy Server Monitoring
Easy Server MonitoringEasy Server Monitoring
Easy Server Monitoring
 
Data diode
Data diodeData diode
Data diode
 
USB Portal
USB PortalUSB Portal
USB Portal
 
Smart Router
Smart RouterSmart Router
Smart Router
 
Web shell detector
Web shell detectorWeb shell detector
Web shell detector
 
Graph based APT detection
Graph based APT detectionGraph based APT detection
Graph based APT detection
 
Multi-Agent System for APT Detection
Multi-Agent System for APT DetectionMulti-Agent System for APT Detection
Multi-Agent System for APT Detection
 
Building k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text DataBuilding k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text Data
 
Determining the k in k-means with MapReduce
Determining the k in k-means with MapReduceDetermining the k in k-means with MapReduce
Determining the k in k-means with MapReduce
 
Parallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopParallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with Hadoop
 

Recently uploaded

Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Creative-Biolabs
 
gastroretentive drug delivery system-PPT.pptx
gastroretentive drug delivery system-PPT.pptxgastroretentive drug delivery system-PPT.pptx
gastroretentive drug delivery system-PPT.pptx
Shekar Boddu
 
the fundamental unit of life CBSE class 9.pptx
the fundamental unit of life CBSE class 9.pptxthe fundamental unit of life CBSE class 9.pptx
the fundamental unit of life CBSE class 9.pptx
parminder0808singh
 
Physiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptxPhysiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptx
fatima132662
 
23PH301 - Optics - Unit 2 - Interference
23PH301 - Optics - Unit 2 - Interference23PH301 - Optics - Unit 2 - Interference
23PH301 - Optics - Unit 2 - Interference
RDhivya6
 
Mechanics:- Simple and Compound Pendulum
Mechanics:- Simple and Compound PendulumMechanics:- Simple and Compound Pendulum
Mechanics:- Simple and Compound Pendulum
PravinHudge1
 
Explainable Deepfake Image/Video Detection
Explainable Deepfake Image/Video DetectionExplainable Deepfake Image/Video Detection
Explainable Deepfake Image/Video Detection
VasileiosMezaris
 
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Sérgio Sacani
 
Nutaceuticsls herbal drug technology CVS, cancer.pptx
Nutaceuticsls herbal drug technology CVS, cancer.pptxNutaceuticsls herbal drug technology CVS, cancer.pptx
Nutaceuticsls herbal drug technology CVS, cancer.pptx
vimalveerammal
 
Mites,Slug,Snail_Infesting agricultural crops.pdf
Mites,Slug,Snail_Infesting agricultural crops.pdfMites,Slug,Snail_Infesting agricultural crops.pdf
Mites,Slug,Snail_Infesting agricultural crops.pdf
PirithiRaju
 
acanthocytes_causes_etiology_clinical sognificance-future.pptx
acanthocytes_causes_etiology_clinical sognificance-future.pptxacanthocytes_causes_etiology_clinical sognificance-future.pptx
acanthocytes_causes_etiology_clinical sognificance-future.pptx
muralinath2
 
Quality assurance B.pharm 6th semester BP606T UNIT 5
Quality assurance B.pharm 6th semester BP606T UNIT 5Quality assurance B.pharm 6th semester BP606T UNIT 5
Quality assurance B.pharm 6th semester BP606T UNIT 5
vimalveerammal
 
The Powders And The Granules 123456.pptx
The Powders And The Granules 123456.pptxThe Powders And The Granules 123456.pptx
The Powders And The Granules 123456.pptx
sanjeevkhanal2
 
Evaluation and Identification of J'BaFofi the Giant Spider of Congo and Moke...
Evaluation and Identification of J'BaFofi the Giant  Spider of Congo and Moke...Evaluation and Identification of J'BaFofi the Giant  Spider of Congo and Moke...
Evaluation and Identification of J'BaFofi the Giant Spider of Congo and Moke...
MrSproy
 
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
Sérgio Sacani
 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
QusayMaghayerh
 
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR
 
Methods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdfMethods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdf
PirithiRaju
 
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Sérgio Sacani
 
一比一原版美国佩斯大学毕业证如何办理
一比一原版美国佩斯大学毕业证如何办理一比一原版美国佩斯大学毕业证如何办理
一比一原版美国佩斯大学毕业证如何办理
gyhwyo
 

Recently uploaded (20)

Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
 
gastroretentive drug delivery system-PPT.pptx
gastroretentive drug delivery system-PPT.pptxgastroretentive drug delivery system-PPT.pptx
gastroretentive drug delivery system-PPT.pptx
 
the fundamental unit of life CBSE class 9.pptx
the fundamental unit of life CBSE class 9.pptxthe fundamental unit of life CBSE class 9.pptx
the fundamental unit of life CBSE class 9.pptx
 
Physiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptxPhysiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptx
 
23PH301 - Optics - Unit 2 - Interference
23PH301 - Optics - Unit 2 - Interference23PH301 - Optics - Unit 2 - Interference
23PH301 - Optics - Unit 2 - Interference
 
Mechanics:- Simple and Compound Pendulum
Mechanics:- Simple and Compound PendulumMechanics:- Simple and Compound Pendulum
Mechanics:- Simple and Compound Pendulum
 
Explainable Deepfake Image/Video Detection
Explainable Deepfake Image/Video DetectionExplainable Deepfake Image/Video Detection
Explainable Deepfake Image/Video Detection
 
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
 
Nutaceuticsls herbal drug technology CVS, cancer.pptx
Nutaceuticsls herbal drug technology CVS, cancer.pptxNutaceuticsls herbal drug technology CVS, cancer.pptx
Nutaceuticsls herbal drug technology CVS, cancer.pptx
 
Mites,Slug,Snail_Infesting agricultural crops.pdf
Mites,Slug,Snail_Infesting agricultural crops.pdfMites,Slug,Snail_Infesting agricultural crops.pdf
Mites,Slug,Snail_Infesting agricultural crops.pdf
 
acanthocytes_causes_etiology_clinical sognificance-future.pptx
acanthocytes_causes_etiology_clinical sognificance-future.pptxacanthocytes_causes_etiology_clinical sognificance-future.pptx
acanthocytes_causes_etiology_clinical sognificance-future.pptx
 
Quality assurance B.pharm 6th semester BP606T UNIT 5
Quality assurance B.pharm 6th semester BP606T UNIT 5Quality assurance B.pharm 6th semester BP606T UNIT 5
Quality assurance B.pharm 6th semester BP606T UNIT 5
 
The Powders And The Granules 123456.pptx
The Powders And The Granules 123456.pptxThe Powders And The Granules 123456.pptx
The Powders And The Granules 123456.pptx
 
Evaluation and Identification of J'BaFofi the Giant Spider of Congo and Moke...
Evaluation and Identification of J'BaFofi the Giant  Spider of Congo and Moke...Evaluation and Identification of J'BaFofi the Giant  Spider of Congo and Moke...
Evaluation and Identification of J'BaFofi the Giant Spider of Congo and Moke...
 
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
 
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
 
Methods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdfMethods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdf
 
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
 
一比一原版美国佩斯大学毕业证如何办理
一比一原版美国佩斯大学毕业证如何办理一比一原版美国佩斯大学毕业证如何办理
一比一原版美国佩斯大学毕业证如何办理
 

An introduction to similarity search and k-nn graphs

  • 1. Distributed k-nearest neighbors graph algorithms Thibault Debatty, Ir PhD 2019-12-03
  • 2. Distributed k-nearest neighbors graph algorithms 2 k-nn graph Edge to k most similar nodes
  • 3. Distributed k-nearest neighbors graph algorithms 3 Context Common tasks of machine learning, data mining, Artificial Intelligence or Big Data: ● Similarity search ● Clustering ● Anomaly detection
  • 4. Distributed k-nearest neighbors graph algorithms 4 Context : similarity search
  • 5. Distributed k-nearest neighbors graph algorithms 5 Context : similarity search “High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 v8g6” Similar to a known SPAM?
  • 6. Distributed k-nearest neighbors graph algorithms 6 Context : clustering Kobe Bryant traded to Clippers No.1 Ma1eEnhancement Supplement. Trusted by Millions. Buy Today! J9 Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price 6xkp Percocet 10/625 mg withoutPrescription 30 tabs - $225! [20100815-3] rjj Nurses make Great Incomes Order Now! HYDROCODONE BRAND Watson 540 10mg/mg, 60 Pills - $479, 90 Pills - $656, 120 Pills - $838 36fy High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 7z3 Play here for summer fun Perfect Watches Clones Cheap from $150. Buy Rep1icaWatches: Swiss Rep1icaWatch xz High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 v69 Obtain details on your cred1t online. Get started today High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 071 Japanese food discount Is your computer safe? Luxury at a Discount! Phentermin 37.5 mg as cheap as 120 pills $366.00 8eg5 Mutant fish sold at Connecticut market High quality JBL speakers Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price xt High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 pl => Identify SPAM campaigns
  • 7. Distributed k-nearest neighbors graph algorithms 7 Context : clustering To analyze 300 rogue websites: ● Cluster ● Analyze 1 representative of each group
  • 8. Distributed k-nearest neighbors graph algorithms 8 Context : anomaly detection Find infected computer on a network
  • 9. Distributed k-nearest neighbors graph algorithms 9 Context ● Similarity search ● Clustering and ● Anomaly detection … are crucial for data processing!
  • 10. Distributed k-nearest neighbors graph algorithms 10 Challenges How hard can that be?
  • 11. Distributed k-nearest neighbors graph algorithms 11 Challenges Computer memory is similar to a book ● Accessible by address (page) ● You have to read before you know the content (e.g. coordinates of a point)
  • 12. Distributed k-nearest neighbors graph algorithms 12 Challenges Naive similarity search requires to read all pages
  • 13. Distributed k-nearest neighbors graph algorithms 13 Challenges How many pages? Bible TOB: ● 2000 pages ● Extra thin paper ● 12cm ● 44 hours of reading
  • 14. Distributed k-nearest neighbors graph algorithms 14 Challenges Samsung Galaxy S9 (4GB) 63m assuming 4KB/page (atomium = 102m), 2.6 years of reading...
  • 15. Distributed k-nearest neighbors graph algorithms 15 Challenges Our server ● 1500GB ● 200.000 books ● A stack of 24km ● 1000 years of reading Brussels – Louvain la Neuve = 26km
  • 16. Distributed k-nearest neighbors graph algorithms 16 Challenges Even with modern hardware, naive algorithms are not an option
  • 17. Distributed k-nearest neighbors graph algorithms 17 Indexes Divide space in “zones” Example: ● North: pages 1, 2, 3 and 4 ● South: pages 5, 6, and 7
  • 18. Distributed k-nearest neighbors graph algorithms 18 Indexes Similarity search with index “query” is near zone “SOUTH” => read pages 5, 6 and 7
  • 19. Distributed k-nearest neighbors graph algorithms 19 Indexes : limitations Similarity search with index Requires to read multiple zones: 1d : 2 zones 2d : 4 zones 3d : 8 zones 8d : 256 zones “curse of dimensionality”
  • 20. Distributed k-nearest neighbors graph algorithms 20 Indexes : limitations Great for low dimensional Euclidean datasets (time) But what about ● Higher dimensions? TV commercials: 4125 dimensions ● Text?
  • 21. Distributed k-nearest neighbors graph algorithms 21 k-nn graph Can we use a k-nn graph for analyzing large datasets ?
  • 22. Distributed k-nearest neighbors graph algorithms 22 k-nn graph Existing algorithms: ● Clustering ● Similarity search (but slow)
  • 23. Distributed k-nearest neighbors graph algorithms 23 Outline Build from large text datasets ● Fast similarity search ● Add and remove points ● Applications: – Text clustering – Detection of compromised computers ● … using distributed processing!
  • 24. Distributed k-nearest neighbors graph algorithms 24 Build from large text datasets
  • 25. Distributed k-nearest neighbors graph algorithms 25 String similarity But first… how to measure similarity between strings? Lots of literature: ● Levenshtein ● Damerau ● Jaro-Winkler ● N-Gram ● Q-Gram ● Cosine ● Jaccard index ● … But no clean implementation!
  • 26. Distributed k-nearest neighbors graph algorithms 26 String similarity
  • 27. Distributed k-nearest neighbors graph algorithms 27 String similarity
  • 28. Distributed k-nearest neighbors graph algorithms 28 String similarity
  • 29. Distributed k-nearest neighbors graph algorithms 29 String similarity
  • 30. Design and analysis of distributed k-nearest neighbors graph algorithms 30 Building from text datasets ● NN-Descent Build an approximate graph Compute O(n1.14) similarities ● BUT: iterative!
  • 31. Distributed k-nearest neighbors graph algorithms 31 Building from text datasets NNCTPH ● Hash using modified hashing function CTPH / ssdeep / spamsum ● Build subgraphs in parallel ● Merge subgraphs Single iteration!
  • 32. Distributed k-nearest neighbors graph algorithms 32 Building from text datasets
  • 33. Distributed k-nearest neighbors graph algorithms 33 Building from text datasets ● Experimental evaluation: – Apache Hadoop MapReduce – SPAM dataset – Jaro-Winkler string similarity (not metric)
  • 34. Distributed k-nearest neighbors graph algorithms 34 Building from text datasets
  • 35. Distributed k-nearest neighbors graph algorithms 35 Fast similarity search Add and remove points
  • 36. Distributed k-nearest neighbors graph algorithms 36 Online building ● Given a distributed graph: – Add nodes – Remove nodes – Search nearest neighbors of query node ● Requires k-medoids partitioning of graph
  • 37. Distributed k-nearest neighbors graph algorithms 37 Partitioning ● k-medoids clustering ● CLARANS is slow to converge ● Two faster methods: – Inspired by Simulated Annealing – Heuristic ● Impact of partitioning when we perform distributed search
  • 38. Distributed k-nearest neighbors graph algorithms 38 Applications
  • 39. Distributed k-nearest neighbors graph algorithms 39 Text clustering ● Text dataset with Jaro-Winkler similarity (not a metric) ● Steps: – Build (approximate) k-nn graph – Prune – Compute connected components
  • 40. Distributed k-nearest neighbors graph algorithms 40 APT Detection ● Advanced => no signatures ● Persistent => limited activity ● Threats ● Need a C2 channel
  • 41. Distributed k-nearest neighbors graph algorithms 41 APT Detection
  • 42. Distributed k-nearest neighbors graph algorithms 42 APT Detection Here: APT relying on HTTP => proxy logs
  • 43. Distributed k-nearest neighbors graph algorithms 43 APT Detection How hard can that be?
  • 44. Distributed k-nearest neighbors graph algorithms 44 APT Detection
  • 45. Distributed k-nearest neighbors graph algorithms 45 APT Detection Displaying a page requires multiple HTTP requests => link each request to its parent using the logs from the proxy
  • 46. Distributed k-nearest neighbors graph algorithms 46 APT Detection
  • 47. Distributed k-nearest neighbors graph algorithms 47 APT Detection
  • 48. Distributed k-nearest neighbors graph algorithms 48 APT Detection weight is higher if: ● Requests are close in time ● Requests belong to the same domain ● Same sequence repeats
  • 49. Distributed k-nearest neighbors graph algorithms 49 APT Detection After pruning the weighted graph, the APT remains isolated!
  • 50. Distributed k-nearest neighbors graph algorithms 50 APT Detection weight is higher if: ● Requests are close in time ● Requests belong to the same domain ● Same sequence repeats
  • 51. Distributed k-nearest neighbors graph algorithms 51 APT Detection ● Batch: build graphs ● Interactive (web interface): – Merge – Prune – Cluster – Filter ● Approximate k-nn graph (time and memory)
  • 52. Distributed k-nearest neighbors graph algorithms 52 APT Detection
  • 53. Distributed k-nearest neighbors graph algorithms 53 APT Detection ● Experimental evaluation – Proxy logs of real network – Simulated APT traffic – Rank suspicious domains ● Results – High detection / false alarm ratio – Without prior knowledge about APT
  • 54. Distributed k-nearest neighbors graph algorithms 54 APT Detection ● False positives: – Content Delivery Networks (CDN) – Advertising domains – Javascript library delivery – Websites with very few visits => same behavior as APT
  • 55. Distributed k-nearest neighbors graph algorithms 55 Conclusion k-nn graph is an interesting tool to analyze large datasets, but ● Only if approximation is acceptable ● Other possibilities exist
  • 56. Distributed k-nearest neighbors graph algorithms 56 Perspectives... ● Broaden to other graph-like structures: – (Hierarchical) Small World Network graphs – Asymmetrical graphs ● Broaden to other applications (clustering, nn search) ● Predict the magnitude of approximation
  • 57. Distributed k-nearest neighbors graph algorithms 57 Questions... Cyber Defence Lab www.cylab.be