SlideShare a Scribd company logo
From ensembles
to computer
networks
Sevvandi Kandanaarachchi
Data61 MLRG Seminar
21st April 2022
1
Overview
• Why is finding interesting patterns in data important?
• Methodology: Item Response Theory to construct an anomaly
detection ensemble
• An application: computer networks
• Next challenges
2
Interesting patterns
in data
3
Interesting patterns in data – Why?
• We live in a data-rich world
• Phones and personal smart devices
• Videos/CCTV
• Satellites roaming around the planet
• Social media and content generation
• Wearable technology (heart rate monitor)
This Photo by Unknown
Author is licensed under CC
BY-SA
4
What should we focus upon?
• Impossible to go through all the data in real time
• But we want to know when something “important”
happens
• Important – context dependent
• A person who is monitored has had a fall (wearables)
• Deforestation (satellite data)
• A group harmful to society is gaining popularity (social
media – national security)
• A bushfire starting off
This Photo by Unknown Author is licensed under CC BY
This Photo by Unknown Author is licensed under
CC BY-ND
5
Challenges
• Automated tools to extract these events of interest
• Early detection is super important
• High accuracy
• Low false positive rates
• Complex, noisy data
Goals
• To allocate resources effectively and efficiently
• Prevent disaster from happening
• Or minimize the loss
6
A critical piece – finding the interesting bit
• It can be called many names
• Events, anomalies, outliers, novelties, emerging threats
• Can’t always train a model to find the interesting bit
• Can’t lock in what is interesting
• Training a model on certain fraud/intrusions/cyber attacks is not optimal, because there are
new types of fraud/attacks, always!
• Antivirus – known viruses only
• You want something more “intelligent” and accurate
• Alerts you when something weird happens with high accuracy
• Flexible (can evolve)
• A shift of focus over time
• Previously outliers were detected to be discarded – they make the model worse
• Now, we want to know about the anomalies – they are telling us something interesting
7
We looked at interesting patterns in data.
Next, we look at some specific research.
8
An anomaly
detection
ensemble using
Item Response
Theory
Unsupervised Anomaly Detection Ensembles using Item Response
Theory
Sevvandi Kandanaarachchi
Information Sciences (2022)
9
What are we trying to do?
Achieve Higher
Accuracy
New methods with
better accuracy
Build an ensemble
from existing
methods
10
What are we trying to do?
Achieve Higher
Accuracy
New methods with
better accuracy
Build an ensemble
from existing
methods
11
Specific challenges
• In regression we have 𝑥, 𝑦 → (𝑥, 𝑦 )
• So you can use e = 𝑦 − 𝑦 in your ensemble
• The models can be weighted by their accuracy
But…
• Unsupervised anomaly detection does not have 𝑦
• We have 𝑥 → each AD method gives 𝑦1, 𝑦2, 𝑦3, 𝑦4 → Ensemble gives
𝑦𝑒𝑛𝑠
12
What is an anomaly detection ensemble?
Dataset
Unsupervised
AD methods
The AD methods are heterogenous methods
AD ensemble
Ensemble
Score
The data 𝑥 The anomaly scores 𝑦1, 𝑦2 𝑦3 𝑦4, 𝑦5 , 𝑦6, 𝑦7
13
We use Item Response Theory to construct
the ensemble
 Explain IRT
 How we use it to construct an AD ensemble
14
What is Item Response Theory (IRT)?
• A set of models used in educational psychometrics/social sciences
• Premise - intrinsic “quality” that cannot be measured directly
• Racial prejudice or stress proneness
• Political inclinations
• Verbal or mathematical ability
• A test instrument
• A survey
• Exam
This Photo by Unknown Author is
licensed under CC BY-SA
15
IRT
Survey responses
Exam marks
IRT Model
Output
Discrimination of each test item
Difficulty of each test item
Participant ability (hidden quality)
16
IRT in education
• 𝑁 Students answer 𝑛 questions
• Your input to the IRT model is a matrix of
marks 𝑌𝑁×𝑛
• Fit the IRT model
• You get as your output
• Test item discrimination
• Test item difficulty
• Student ability (latent trait)
• Focus is on item discrimination and
difficulty
Q 1 Q 2 Q 3 Q 4
Stu 1 0.95 0.87 0.67 0.84
Stu 2 0.57 0.49 0.78 0.77
Stu n 0.75 0.86 0.57 0.45
17
IRT in psychometrics
• A survey
• Rosenberg's Self-Esteem Scale
• I feel I am a person of worth (Strongly Agree/Agree/Neutral/... )
• Use original responses (no marking as in education)
• Fit the IRT model
• Output
• Participants self-esteem (hidden quality = latent trait)
• Question discrimination
• Question difficulty
• Focus is on the hidden ability
18
IRT in Data Science/Machine Learning
• Relatively new area of research
• From performance data find
• Ability of classifiers
• Discrimination/difficulty of datasets
• 2019 - Item response theory in AI: Analysing machine learning
classifiers at the instance level – F. Martínez-Plumed et al.
19
IRT ensemble for anomaly detection
Latent trait = the anomalousness of the observations = the
ensemble score
High values → high anomalousness, low values → low
anomalousness
Matrix of
anomaly scores
𝑌𝑁×𝑛
IRT Model
=
20
Example
• AD methods (DDoutlier, h2o, e1071)
• KNN_AGG
• LOF
• COF
• INFLO
• KDEOS
• LDF
• LDOF
• Autoencoders – Deep learning
• OCSVM – One class Support Vector
Machine
• Isolation Forest – Tree based method
Nearest neighbourhood-based
methods
Density/distance based
Dataset
Unsupervised
AD methods
AD ensemble
Ensemble
Score
21
Unsupervised AD methods output
𝑌𝑁×𝑛 =
22
IRT Ensemble
𝑌𝑁×𝑛
IRT
ensemble
Ensemble
Score
23
Fitting the IRT model
• Maximising the expectation
• 𝐸 = 𝑁 𝑗(ln 𝛼𝑗 + ln |𝛾𝑗|) − 1/ 2 𝑖 𝑗 𝛼𝑗
2
𝛽𝑗 + 𝛾𝑗𝑧𝑖𝑗 − 𝜇𝑖
𝑡
2
+
Why does it work?
• Ensemble scores
𝜃𝑖 =
𝑗 𝛼𝑗
2
(𝛽𝑗+𝛾𝑗𝑧𝑖𝑗)
𝑗 𝛼𝑗
2
𝜃𝑖 - ensemble score for the 𝑖𝑡ℎ observation
𝛼𝑗 - discrimination
𝛾𝑗 - scaling parameters for the 𝑗𝑡ℎ AD method
𝛽𝑗 - difficulty
𝑧𝑖𝑗 - anomaly score of the 𝑗𝑡ℎ AD method on the 𝑖𝑡ℎ observation
25
Why does it work?
𝜃𝑖 =
𝑗 𝛼𝑗
2
(𝛽𝑗+𝛾𝑗𝑧𝑖𝑗)
𝑗 𝛼𝑗
2 = 𝑗(𝑐𝑗 + 𝑤𝑗𝑧𝑖𝑗)
• Ensemble scores are a weighted average of the original anomaly
scores
• The weights 𝑤𝑗 depend on the discrimination and scaling parameters
of each anomaly detection method
• AD Methods with higher discrimination get higher weights
• Ensemble accentuates better methods and downplays noisy methods
Each AD
method has a
weight
26
This work
• R package – outlierensembles – on CRAN
• Extends R package EstCRM for IRT
• Includes other anomaly detection ensembles as well
• More details on the paper https://arxiv.org/abs/2106.06243
27
We looked at an AD ensemble.
Next, we dive into an application.
28
An application in
computer network
security
Honeyboost: Boosting honeypot performance with data fusion and
anomaly detection
Sevvandi Kandanaarachchi, Hideya Ochiai (UTokyo), Asha Rao (RMIT)
Expert Systems with Applications (2022)
29
LAN Security Monitoring Project
• Between 12 ASEAN and
SAARC countries
• Boost cyber-resilience
among partners
• Countries in low
economic conditions
• Cost effective methods
• Focus on Local Area
Networks (LAN)
Average Monthly Malware Encounter Rate, 2018
(Microsoft, Security Intelligence Report, 2019)
About 10 nodes in Japan
3 nodes in Malaysia
1 node in Laos
6 nodes
in Thailand
2 nodes
in Myanmar
4 nodes in Indonesia
2 nodes in Cambodia
2 nodes
in India
2 node in
Philippines
4 nodes in Vietnam
30
LAN: Local Area Network
LAN-Security Monitoring
Device (honeypot)
Smartphones
Printer
Smart Appliances
Data Server
Inside a Local Area Network (LAN)
• Devices communicating with each other
• Any suspicious behaviour?
• Detect malware in action
31
The Data
• Several protocol features
• Features derived by
looking at packet headers
• Features specific to the
protocol
• Each protocol has a
different number of
features
Timestamp From_Node F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
1553585825, '172.16.1.107', 80, 2, 64, 0, 2, 0, 0, 0, 0, 1, 1
1553585890, '172.16.1.107', 80, 2, 64, 0, 2, 0, 0, 0, 0, 1, 1
1553660565, '172.16.1.107', 80, 1, 64, 0, 1, 0, 0, 0, 0, 1, 1
1553660570, '172.16.1.107', 80, 1, 64, 0, 1, 0, 0, 0, 0, 0, 0
1553667575, '172.16.1.107', 80, 3, 64, 0, 3, 0, 0, 0, 0, 2, 2
1553667580, '172.16.1.107', 80, 1, 64, 0, 1, 0, 0, 0, 0, 0, 0
1553751195, '172.16.1.208', 80, 1, 64, 0, 1, 0, 0, 0, 0, 0, 0
Protocol 1
Timestamp From_Node G1 G2 G3
1554351595, '172.16.1.86', 3702, 2, 652
1554351595, '172.16.1.86', 137, 2, 78
1554351595, '172.16.1.86', 1900, 4, 146
1554351595, '172.16.1.86', 7, 1, 28
Protocol 2
32
Varying-dimensional time series
• Sort by time, then by node
• Different protocols have different features
• Finding anomalies from varying-dimensional time series
• 400 computers/nodes = 400 varying-dimensional time series
• Which ones are anomalous?
time
33
Varying-
dimensional time
series for each node
multivariate time
series
Compute features
The methodology
• Using a window model
• We know the real anomalous nodes and the times (they access
something they shouldn’t - honeypot)
AD method
lookout
34
Varying-
dimensional time
series for each node
multivariate time
series
Timestamp Protocol ARP count ARP
degree
TCP PC1 TCP PC2 UDP PC1 UDP PC2
30 ARP 10 12 0 0 0 0
55 TCP 0 0 -2.15 1.75 0 0
85 UDP 0 0 0 0 3.56 0.45
Node A
35
multivariate time
series
Compute features
Timest
amp
Protoc
ol
ARP
count
ARP
degree
TCP
PC1
TCP
PC2
UDP
PC1
UDP
PC2
30 ARP 10 12 0 0 0 0
55 TCP 0 0 -2.15 1.75 0 0
85 UDP 0 0 0 0 3.56 0.45
Node A
𝑅17
MV time series for each
node gets transformed to a
point in 𝑅17
Feature space for
all nodes
36
Features
• The total length of line segments in 𝑅6
• The maximum time difference
• Number of protocols used
• Number of TCP calls/UDP calls
• Total length of line segments in each protocol space
• Line of best fit in in each protocol space
• Sum of errors squared for the line of best fit
TCP PC1
TCP PC2
37
AD method - lookout
• lookout - work with Rob Hyndman. Published in JCGS (2021)
• Uses Extreme Value Theory (EVT) to find anomalies
• Applicability: Computer network traffic has heavy tails – EVT can
handle that
Feature space AD method
lookout
38
Results
• We identify real anomalies
before they access the
honeypot (they shouldn’t do
that)
• The nodes behave in an
anomalous way before a
“breach” is triggered
• We can predict a breach using
this method
• Low false positives
• Visualize anomalies develop
• Discover patterns of suspicious
behaviour
39
Thoughts ...
• This was a classic data science problem
• We were given the data and the problem context and asked to tackle
it
• How do you formulate the problem?
• Many building blocks
• Identifying anomalies and visualising them aids decision making
• Bonus: open up a research avenue
• Underlying general research problem, not application-specific
40
Another way to think about this problem
• Model the network dynamics
• Find suspicious behaviour in a
network
• Network dynamics not commonly
used in cyber security
• Public datasets do not facilitate that
• Growth potential in this area
• Tom Bernardi’s MSc project
41
What challenges lie
ahead?
42
Next challenges for the field
• Networks
• Anomalies/events in networks (computer networks)
• Nuwan’s MSc Project on behavioural biometrics
• Visualization of networks at different granularities
• Dynamic networks – echo chambers – how they form
• Event detection in spatiotemporal data
• Applications in epidemiology
• Can you identify a hotspot before it happens?
• Ecology
• Algorithm bias
• Bias in data + bias in algorithm
43
Recap: ensembles to networks
• Broad applicability in detecting
interesting patterns in data
• Applications in cyber security, wearable
sensors, satellite data, social media
• Core research problem ties back to
statistics/maths
• Need robust, highly accurate
methodologies that can capture these
patterns
• Exciting field. Thrilled to be part of it!
Stats
&
Maths
This Photo by Unknown
Author is licensed under CC
BY-SA
44
https://sevvandi.netlify.app/
@sevvandik
45
Extra Slides
46
IRT Ensemble related
47
Continuous IRT model
• Samejima, 1969 – Continuous Response Model
• Wang and Zeng, 1998 – Procedure to compute item parameters using
expectation maximization for Samejima’s model
• Shojima, 2005 – Non-iterative item parameter solution in each EM
cycle
• Zopluoglu, 2015 – EstCRM R package implements Shojima’s 2005
model
• Update the loglikelihood – to include negative discrimination items
48
Comparison
Comparison
49
Example with
iterations
Data in 𝑅6
- first 2 dimensions shown,
others normally distributed
Evaluation metric – Area under ROC
Iteration 5
Iteration 10
50
Comparison on a data repository
51
Computer Networks related
52
LAN Security Monitoring
• ‘LAN-Security Monitoring Device’ to capture suspicious/ malicious
activities that happen inside a LAN.
LAN: Local Area Network
LAN-Security Monitoring Device
Honeypot - a trap for attackers
Smartphones
Printer
Smart Appliances
Data Server
53
Two approaches
54
Findings
• Suspicious nodes that do not
access the honeypot
Feature space for
all nodes
lookout
This node
does not
access the
honeypot
This node
does not
access the
honeypot
55
Low false positives
56

More Related Content

Similar to From ensembles to computer networks

ICSE '22 Presentaion_Sherry.pdf
ICSE '22 Presentaion_Sherry.pdfICSE '22 Presentaion_Sherry.pdf
ICSE '22 Presentaion_Sherry.pdf
XueqiYang
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
Kuppusamy P
 
Algorithm evaluation using item response theory
Algorithm evaluation using item response theoryAlgorithm evaluation using item response theory
Algorithm evaluation using item response theory
CSIRO
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningJeff Heaton
 
Smartphones as ubiquitous devices for behavior analysis and better lifestyle ...
Smartphones as ubiquitous devices for behavior analysis and better lifestyle ...Smartphones as ubiquitous devices for behavior analysis and better lifestyle ...
Smartphones as ubiquitous devices for behavior analysis and better lifestyle ...
University of Geneva
 
1. Intro DS.pptx
1. Intro DS.pptx1. Intro DS.pptx
1. Intro DS.pptx
Anusuya123
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
Omid Vahdaty
 
From Threat Intelligence to Defense Cleverness: A Data Science Approach (#tid...
From Threat Intelligence to Defense Cleverness: A Data Science Approach (#tid...From Threat Intelligence to Defense Cleverness: A Data Science Approach (#tid...
From Threat Intelligence to Defense Cleverness: A Data Science Approach (#tid...
Alex Pinto
 
Anomaly Detection and Automatic Labeling with Deep Learning
Anomaly Detection and Automatic Labeling with Deep LearningAnomaly Detection and Automatic Labeling with Deep Learning
Anomaly Detection and Automatic Labeling with Deep Learning
Adam Gibson
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
SaketBansal9
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
Egyptian Engineers Association
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Impetus Technologies
 
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...Seminar Presentation | Network Intrusion Detection using Supervised Machine L...
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...
Jowin John Chemban
 
How AI Helps Observe Decentralised Systems
How AI Helps Observe Decentralised SystemsHow AI Helps Observe Decentralised Systems
How AI Helps Observe Decentralised Systems
Dominic Wellington
 
Open Anti-Cheat System (OACS)
Open Anti-Cheat System (OACS)Open Anti-Cheat System (OACS)
Open Anti-Cheat System (OACS)
Stephen Larroque
 
Throttling Malware Families in 2D
Throttling Malware Families in 2DThrottling Malware Families in 2D
Throttling Malware Families in 2D
Mohamed Nassar
 
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Chris Hammerschmidt
 
N2C2M2 Validation using abELICIT
N2C2M2 Validation using abELICITN2C2M2 Validation using abELICIT
N2C2M2 Validation using abELICITMarco Manso
 
Explainable algorithm evaluation.pptx
Explainable algorithm evaluation.pptxExplainable algorithm evaluation.pptx
Explainable algorithm evaluation.pptx
CSIRO
 
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00tDefcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
pseudor00t overflow
 

Similar to From ensembles to computer networks (20)

ICSE '22 Presentaion_Sherry.pdf
ICSE '22 Presentaion_Sherry.pdfICSE '22 Presentaion_Sherry.pdf
ICSE '22 Presentaion_Sherry.pdf
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
Algorithm evaluation using item response theory
Algorithm evaluation using item response theoryAlgorithm evaluation using item response theory
Algorithm evaluation using item response theory
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morning
 
Smartphones as ubiquitous devices for behavior analysis and better lifestyle ...
Smartphones as ubiquitous devices for behavior analysis and better lifestyle ...Smartphones as ubiquitous devices for behavior analysis and better lifestyle ...
Smartphones as ubiquitous devices for behavior analysis and better lifestyle ...
 
1. Intro DS.pptx
1. Intro DS.pptx1. Intro DS.pptx
1. Intro DS.pptx
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
 
From Threat Intelligence to Defense Cleverness: A Data Science Approach (#tid...
From Threat Intelligence to Defense Cleverness: A Data Science Approach (#tid...From Threat Intelligence to Defense Cleverness: A Data Science Approach (#tid...
From Threat Intelligence to Defense Cleverness: A Data Science Approach (#tid...
 
Anomaly Detection and Automatic Labeling with Deep Learning
Anomaly Detection and Automatic Labeling with Deep LearningAnomaly Detection and Automatic Labeling with Deep Learning
Anomaly Detection and Automatic Labeling with Deep Learning
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
 
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...Seminar Presentation | Network Intrusion Detection using Supervised Machine L...
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...
 
How AI Helps Observe Decentralised Systems
How AI Helps Observe Decentralised SystemsHow AI Helps Observe Decentralised Systems
How AI Helps Observe Decentralised Systems
 
Open Anti-Cheat System (OACS)
Open Anti-Cheat System (OACS)Open Anti-Cheat System (OACS)
Open Anti-Cheat System (OACS)
 
Throttling Malware Families in 2D
Throttling Malware Families in 2DThrottling Malware Families in 2D
Throttling Malware Families in 2D
 
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
 
N2C2M2 Validation using abELICIT
N2C2M2 Validation using abELICITN2C2M2 Validation using abELICIT
N2C2M2 Validation using abELICIT
 
Explainable algorithm evaluation.pptx
Explainable algorithm evaluation.pptxExplainable algorithm evaluation.pptx
Explainable algorithm evaluation.pptx
 
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00tDefcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
 

More from CSIRO

The painful removal of tiling artefacts in hypersprectral data
The painful removal of tiling artefacts in hypersprectral dataThe painful removal of tiling artefacts in hypersprectral data
The painful removal of tiling artefacts in hypersprectral data
CSIRO
 
Explainable insights on algorithm performance
Explainable insights on algorithm performanceExplainable insights on algorithm performance
Explainable insights on algorithm performance
CSIRO
 
The painful removal of tiling artefacts in ToF-SIMS data
The painful removal of tiling artefacts in ToF-SIMS dataThe painful removal of tiling artefacts in ToF-SIMS data
The painful removal of tiling artefacts in ToF-SIMS data
CSIRO
 
Sophisticated tools for spatio-temporal data exploration
Sophisticated tools for spatio-temporal data explorationSophisticated tools for spatio-temporal data exploration
Sophisticated tools for spatio-temporal data exploration
CSIRO
 
Explainable algorithm evaluation from lessons in education
Explainable algorithm evaluation from lessons in educationExplainable algorithm evaluation from lessons in education
Explainable algorithm evaluation from lessons in education
CSIRO
 
A time series of networks. Is everything OK? Are there anomalies?
A time series of networks. Is everything OK? Are there anomalies?A time series of networks. Is everything OK? Are there anomalies?
A time series of networks. Is everything OK? Are there anomalies?
CSIRO
 
Anomalous Networks
Anomalous NetworksAnomalous Networks
Anomalous Networks
CSIRO
 
Four, fast geostatistical methods - a comparison
Four, fast geostatistical methods - a comparisonFour, fast geostatistical methods - a comparison
Four, fast geostatistical methods - a comparison
CSIRO
 
Comparison of geostatistical methods for spatial data
Comparison of geostatistical methods for spatial dataComparison of geostatistical methods for spatial data
Comparison of geostatistical methods for spatial data
CSIRO
 
Algorithm evaluation using Item Response Theory
Algorithm evaluation using Item Response TheoryAlgorithm evaluation using Item Response Theory
Algorithm evaluation using Item Response Theory
CSIRO
 
Evaluating algorithms using Item Response Theory
Evaluating algorithms using Item Response TheoryEvaluating algorithms using Item Response Theory
Evaluating algorithms using Item Response Theory
CSIRO
 
Mathematics of anomalies
Mathematics of anomaliesMathematics of anomalies
Mathematics of anomalies
CSIRO
 
Here is the anomalow-down!
Here is the anomalow-down!Here is the anomalow-down!
Here is the anomalow-down!
CSIRO
 
Looking out for anomalies
Looking out for anomaliesLooking out for anomalies
Looking out for anomalies
CSIRO
 

More from CSIRO (14)

The painful removal of tiling artefacts in hypersprectral data
The painful removal of tiling artefacts in hypersprectral dataThe painful removal of tiling artefacts in hypersprectral data
The painful removal of tiling artefacts in hypersprectral data
 
Explainable insights on algorithm performance
Explainable insights on algorithm performanceExplainable insights on algorithm performance
Explainable insights on algorithm performance
 
The painful removal of tiling artefacts in ToF-SIMS data
The painful removal of tiling artefacts in ToF-SIMS dataThe painful removal of tiling artefacts in ToF-SIMS data
The painful removal of tiling artefacts in ToF-SIMS data
 
Sophisticated tools for spatio-temporal data exploration
Sophisticated tools for spatio-temporal data explorationSophisticated tools for spatio-temporal data exploration
Sophisticated tools for spatio-temporal data exploration
 
Explainable algorithm evaluation from lessons in education
Explainable algorithm evaluation from lessons in educationExplainable algorithm evaluation from lessons in education
Explainable algorithm evaluation from lessons in education
 
A time series of networks. Is everything OK? Are there anomalies?
A time series of networks. Is everything OK? Are there anomalies?A time series of networks. Is everything OK? Are there anomalies?
A time series of networks. Is everything OK? Are there anomalies?
 
Anomalous Networks
Anomalous NetworksAnomalous Networks
Anomalous Networks
 
Four, fast geostatistical methods - a comparison
Four, fast geostatistical methods - a comparisonFour, fast geostatistical methods - a comparison
Four, fast geostatistical methods - a comparison
 
Comparison of geostatistical methods for spatial data
Comparison of geostatistical methods for spatial dataComparison of geostatistical methods for spatial data
Comparison of geostatistical methods for spatial data
 
Algorithm evaluation using Item Response Theory
Algorithm evaluation using Item Response TheoryAlgorithm evaluation using Item Response Theory
Algorithm evaluation using Item Response Theory
 
Evaluating algorithms using Item Response Theory
Evaluating algorithms using Item Response TheoryEvaluating algorithms using Item Response Theory
Evaluating algorithms using Item Response Theory
 
Mathematics of anomalies
Mathematics of anomaliesMathematics of anomalies
Mathematics of anomalies
 
Here is the anomalow-down!
Here is the anomalow-down!Here is the anomalow-down!
Here is the anomalow-down!
 
Looking out for anomalies
Looking out for anomaliesLooking out for anomalies
Looking out for anomalies
 

Recently uploaded

Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 

Recently uploaded (20)

Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 

From ensembles to computer networks

  • 1. From ensembles to computer networks Sevvandi Kandanaarachchi Data61 MLRG Seminar 21st April 2022 1
  • 2. Overview • Why is finding interesting patterns in data important? • Methodology: Item Response Theory to construct an anomaly detection ensemble • An application: computer networks • Next challenges 2
  • 4. Interesting patterns in data – Why? • We live in a data-rich world • Phones and personal smart devices • Videos/CCTV • Satellites roaming around the planet • Social media and content generation • Wearable technology (heart rate monitor) This Photo by Unknown Author is licensed under CC BY-SA 4
  • 5. What should we focus upon? • Impossible to go through all the data in real time • But we want to know when something “important” happens • Important – context dependent • A person who is monitored has had a fall (wearables) • Deforestation (satellite data) • A group harmful to society is gaining popularity (social media – national security) • A bushfire starting off This Photo by Unknown Author is licensed under CC BY This Photo by Unknown Author is licensed under CC BY-ND 5
  • 6. Challenges • Automated tools to extract these events of interest • Early detection is super important • High accuracy • Low false positive rates • Complex, noisy data Goals • To allocate resources effectively and efficiently • Prevent disaster from happening • Or minimize the loss 6
  • 7. A critical piece – finding the interesting bit • It can be called many names • Events, anomalies, outliers, novelties, emerging threats • Can’t always train a model to find the interesting bit • Can’t lock in what is interesting • Training a model on certain fraud/intrusions/cyber attacks is not optimal, because there are new types of fraud/attacks, always! • Antivirus – known viruses only • You want something more “intelligent” and accurate • Alerts you when something weird happens with high accuracy • Flexible (can evolve) • A shift of focus over time • Previously outliers were detected to be discarded – they make the model worse • Now, we want to know about the anomalies – they are telling us something interesting 7
  • 8. We looked at interesting patterns in data. Next, we look at some specific research. 8
  • 9. An anomaly detection ensemble using Item Response Theory Unsupervised Anomaly Detection Ensembles using Item Response Theory Sevvandi Kandanaarachchi Information Sciences (2022) 9
  • 10. What are we trying to do? Achieve Higher Accuracy New methods with better accuracy Build an ensemble from existing methods 10
  • 11. What are we trying to do? Achieve Higher Accuracy New methods with better accuracy Build an ensemble from existing methods 11
  • 12. Specific challenges • In regression we have 𝑥, 𝑦 → (𝑥, 𝑦 ) • So you can use e = 𝑦 − 𝑦 in your ensemble • The models can be weighted by their accuracy But… • Unsupervised anomaly detection does not have 𝑦 • We have 𝑥 → each AD method gives 𝑦1, 𝑦2, 𝑦3, 𝑦4 → Ensemble gives 𝑦𝑒𝑛𝑠 12
  • 13. What is an anomaly detection ensemble? Dataset Unsupervised AD methods The AD methods are heterogenous methods AD ensemble Ensemble Score The data 𝑥 The anomaly scores 𝑦1, 𝑦2 𝑦3 𝑦4, 𝑦5 , 𝑦6, 𝑦7 13
  • 14. We use Item Response Theory to construct the ensemble  Explain IRT  How we use it to construct an AD ensemble 14
  • 15. What is Item Response Theory (IRT)? • A set of models used in educational psychometrics/social sciences • Premise - intrinsic “quality” that cannot be measured directly • Racial prejudice or stress proneness • Political inclinations • Verbal or mathematical ability • A test instrument • A survey • Exam This Photo by Unknown Author is licensed under CC BY-SA 15
  • 16. IRT Survey responses Exam marks IRT Model Output Discrimination of each test item Difficulty of each test item Participant ability (hidden quality) 16
  • 17. IRT in education • 𝑁 Students answer 𝑛 questions • Your input to the IRT model is a matrix of marks 𝑌𝑁×𝑛 • Fit the IRT model • You get as your output • Test item discrimination • Test item difficulty • Student ability (latent trait) • Focus is on item discrimination and difficulty Q 1 Q 2 Q 3 Q 4 Stu 1 0.95 0.87 0.67 0.84 Stu 2 0.57 0.49 0.78 0.77 Stu n 0.75 0.86 0.57 0.45 17
  • 18. IRT in psychometrics • A survey • Rosenberg's Self-Esteem Scale • I feel I am a person of worth (Strongly Agree/Agree/Neutral/... ) • Use original responses (no marking as in education) • Fit the IRT model • Output • Participants self-esteem (hidden quality = latent trait) • Question discrimination • Question difficulty • Focus is on the hidden ability 18
  • 19. IRT in Data Science/Machine Learning • Relatively new area of research • From performance data find • Ability of classifiers • Discrimination/difficulty of datasets • 2019 - Item response theory in AI: Analysing machine learning classifiers at the instance level – F. Martínez-Plumed et al. 19
  • 20. IRT ensemble for anomaly detection Latent trait = the anomalousness of the observations = the ensemble score High values → high anomalousness, low values → low anomalousness Matrix of anomaly scores 𝑌𝑁×𝑛 IRT Model = 20
  • 21. Example • AD methods (DDoutlier, h2o, e1071) • KNN_AGG • LOF • COF • INFLO • KDEOS • LDF • LDOF • Autoencoders – Deep learning • OCSVM – One class Support Vector Machine • Isolation Forest – Tree based method Nearest neighbourhood-based methods Density/distance based Dataset Unsupervised AD methods AD ensemble Ensemble Score 21
  • 22. Unsupervised AD methods output 𝑌𝑁×𝑛 = 22
  • 24. Fitting the IRT model • Maximising the expectation • 𝐸 = 𝑁 𝑗(ln 𝛼𝑗 + ln |𝛾𝑗|) − 1/ 2 𝑖 𝑗 𝛼𝑗 2 𝛽𝑗 + 𝛾𝑗𝑧𝑖𝑗 − 𝜇𝑖 𝑡 2 +
  • 25. Why does it work? • Ensemble scores 𝜃𝑖 = 𝑗 𝛼𝑗 2 (𝛽𝑗+𝛾𝑗𝑧𝑖𝑗) 𝑗 𝛼𝑗 2 𝜃𝑖 - ensemble score for the 𝑖𝑡ℎ observation 𝛼𝑗 - discrimination 𝛾𝑗 - scaling parameters for the 𝑗𝑡ℎ AD method 𝛽𝑗 - difficulty 𝑧𝑖𝑗 - anomaly score of the 𝑗𝑡ℎ AD method on the 𝑖𝑡ℎ observation 25
  • 26. Why does it work? 𝜃𝑖 = 𝑗 𝛼𝑗 2 (𝛽𝑗+𝛾𝑗𝑧𝑖𝑗) 𝑗 𝛼𝑗 2 = 𝑗(𝑐𝑗 + 𝑤𝑗𝑧𝑖𝑗) • Ensemble scores are a weighted average of the original anomaly scores • The weights 𝑤𝑗 depend on the discrimination and scaling parameters of each anomaly detection method • AD Methods with higher discrimination get higher weights • Ensemble accentuates better methods and downplays noisy methods Each AD method has a weight 26
  • 27. This work • R package – outlierensembles – on CRAN • Extends R package EstCRM for IRT • Includes other anomaly detection ensembles as well • More details on the paper https://arxiv.org/abs/2106.06243 27
  • 28. We looked at an AD ensemble. Next, we dive into an application. 28
  • 29. An application in computer network security Honeyboost: Boosting honeypot performance with data fusion and anomaly detection Sevvandi Kandanaarachchi, Hideya Ochiai (UTokyo), Asha Rao (RMIT) Expert Systems with Applications (2022) 29
  • 30. LAN Security Monitoring Project • Between 12 ASEAN and SAARC countries • Boost cyber-resilience among partners • Countries in low economic conditions • Cost effective methods • Focus on Local Area Networks (LAN) Average Monthly Malware Encounter Rate, 2018 (Microsoft, Security Intelligence Report, 2019) About 10 nodes in Japan 3 nodes in Malaysia 1 node in Laos 6 nodes in Thailand 2 nodes in Myanmar 4 nodes in Indonesia 2 nodes in Cambodia 2 nodes in India 2 node in Philippines 4 nodes in Vietnam 30
  • 31. LAN: Local Area Network LAN-Security Monitoring Device (honeypot) Smartphones Printer Smart Appliances Data Server Inside a Local Area Network (LAN) • Devices communicating with each other • Any suspicious behaviour? • Detect malware in action 31
  • 32. The Data • Several protocol features • Features derived by looking at packet headers • Features specific to the protocol • Each protocol has a different number of features Timestamp From_Node F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 1553585825, '172.16.1.107', 80, 2, 64, 0, 2, 0, 0, 0, 0, 1, 1 1553585890, '172.16.1.107', 80, 2, 64, 0, 2, 0, 0, 0, 0, 1, 1 1553660565, '172.16.1.107', 80, 1, 64, 0, 1, 0, 0, 0, 0, 1, 1 1553660570, '172.16.1.107', 80, 1, 64, 0, 1, 0, 0, 0, 0, 0, 0 1553667575, '172.16.1.107', 80, 3, 64, 0, 3, 0, 0, 0, 0, 2, 2 1553667580, '172.16.1.107', 80, 1, 64, 0, 1, 0, 0, 0, 0, 0, 0 1553751195, '172.16.1.208', 80, 1, 64, 0, 1, 0, 0, 0, 0, 0, 0 Protocol 1 Timestamp From_Node G1 G2 G3 1554351595, '172.16.1.86', 3702, 2, 652 1554351595, '172.16.1.86', 137, 2, 78 1554351595, '172.16.1.86', 1900, 4, 146 1554351595, '172.16.1.86', 7, 1, 28 Protocol 2 32
  • 33. Varying-dimensional time series • Sort by time, then by node • Different protocols have different features • Finding anomalies from varying-dimensional time series • 400 computers/nodes = 400 varying-dimensional time series • Which ones are anomalous? time 33
  • 34. Varying- dimensional time series for each node multivariate time series Compute features The methodology • Using a window model • We know the real anomalous nodes and the times (they access something they shouldn’t - honeypot) AD method lookout 34
  • 35. Varying- dimensional time series for each node multivariate time series Timestamp Protocol ARP count ARP degree TCP PC1 TCP PC2 UDP PC1 UDP PC2 30 ARP 10 12 0 0 0 0 55 TCP 0 0 -2.15 1.75 0 0 85 UDP 0 0 0 0 3.56 0.45 Node A 35
  • 36. multivariate time series Compute features Timest amp Protoc ol ARP count ARP degree TCP PC1 TCP PC2 UDP PC1 UDP PC2 30 ARP 10 12 0 0 0 0 55 TCP 0 0 -2.15 1.75 0 0 85 UDP 0 0 0 0 3.56 0.45 Node A 𝑅17 MV time series for each node gets transformed to a point in 𝑅17 Feature space for all nodes 36
  • 37. Features • The total length of line segments in 𝑅6 • The maximum time difference • Number of protocols used • Number of TCP calls/UDP calls • Total length of line segments in each protocol space • Line of best fit in in each protocol space • Sum of errors squared for the line of best fit TCP PC1 TCP PC2 37
  • 38. AD method - lookout • lookout - work with Rob Hyndman. Published in JCGS (2021) • Uses Extreme Value Theory (EVT) to find anomalies • Applicability: Computer network traffic has heavy tails – EVT can handle that Feature space AD method lookout 38
  • 39. Results • We identify real anomalies before they access the honeypot (they shouldn’t do that) • The nodes behave in an anomalous way before a “breach” is triggered • We can predict a breach using this method • Low false positives • Visualize anomalies develop • Discover patterns of suspicious behaviour 39
  • 40. Thoughts ... • This was a classic data science problem • We were given the data and the problem context and asked to tackle it • How do you formulate the problem? • Many building blocks • Identifying anomalies and visualising them aids decision making • Bonus: open up a research avenue • Underlying general research problem, not application-specific 40
  • 41. Another way to think about this problem • Model the network dynamics • Find suspicious behaviour in a network • Network dynamics not commonly used in cyber security • Public datasets do not facilitate that • Growth potential in this area • Tom Bernardi’s MSc project 41
  • 43. Next challenges for the field • Networks • Anomalies/events in networks (computer networks) • Nuwan’s MSc Project on behavioural biometrics • Visualization of networks at different granularities • Dynamic networks – echo chambers – how they form • Event detection in spatiotemporal data • Applications in epidemiology • Can you identify a hotspot before it happens? • Ecology • Algorithm bias • Bias in data + bias in algorithm 43
  • 44. Recap: ensembles to networks • Broad applicability in detecting interesting patterns in data • Applications in cyber security, wearable sensors, satellite data, social media • Core research problem ties back to statistics/maths • Need robust, highly accurate methodologies that can capture these patterns • Exciting field. Thrilled to be part of it! Stats & Maths This Photo by Unknown Author is licensed under CC BY-SA 44
  • 48. Continuous IRT model • Samejima, 1969 – Continuous Response Model • Wang and Zeng, 1998 – Procedure to compute item parameters using expectation maximization for Samejima’s model • Shojima, 2005 – Non-iterative item parameter solution in each EM cycle • Zopluoglu, 2015 – EstCRM R package implements Shojima’s 2005 model • Update the loglikelihood – to include negative discrimination items 48
  • 50. Example with iterations Data in 𝑅6 - first 2 dimensions shown, others normally distributed Evaluation metric – Area under ROC Iteration 5 Iteration 10 50
  • 51. Comparison on a data repository 51
  • 53. LAN Security Monitoring • ‘LAN-Security Monitoring Device’ to capture suspicious/ malicious activities that happen inside a LAN. LAN: Local Area Network LAN-Security Monitoring Device Honeypot - a trap for attackers Smartphones Printer Smart Appliances Data Server 53
  • 55. Findings • Suspicious nodes that do not access the honeypot Feature space for all nodes lookout This node does not access the honeypot This node does not access the honeypot 55