In todays world majority of information is generated by self sustaining systems like various kinds of bots, crawlers, servers, various online services, etc. This information is flowing on the axis of time and is generated by these actors under some complex logic. For example, a stream of buy/sell order requests by an Order Gateway in financial world, or a stream of web requests by a monitoring / crawling service in the web world, or may be a hacker's bot sitting on internet and attacking various computers. Although we may not be able to know the motive or intention behind these data sources. But via some unsupervised techniques we can try to infer the pattern or correlate the events based on their multiple occurrences on the axis of time. Associating a chain of events in order of time helps in doing a root event analysis. In certain cases a time ordered correlation and root event identification is good enough to automatically identify signatures of various malicious actors and take appropriate corrective actions to stop cyber attacks, stop malicious social campaigns, etc.
Sessionisation is one such unsupervised technique that tries to find the signal in a stream of events associated with a timestamp. In the ideal world it would resolve to finding periods with a mixture of sinusoidal waves. But for the real world this is a much complex activity, as even the systematic events generated by machines over the internet behave in a much erratic manner. So the notion of a period for a signal also changes in the real world. We can no longer associate it with a number, it has to be treated as a random variable, with expected values and associated variance. Hence we need to model "Stochastic periods" and learn their probability distributions in an unsupervised manner.
The main focus of this talk will be to showcase applied data science techniques to discover stochastic periods. There are many ways to obtain periods in data, so the journey would begin by a walk through of existing techniques like FFT (Fast Fourier Transform) then discuss about Gaussian Mixture Models. After highlighting the short comings of these techniques we will succinctly explain one of the most general non-parametric Bayesian approaches to solve this problem. Without going too deep in the complex math, we will get back to applied data science and discuss a much simpler technique that can solve the same problem if certain assumptions are satisfied.
In this talk we will demonstrate some time based pattern we discovered while working on a security analytics use case that uses Sessionisation. In the talk we will demonstrate such patterns based on an open source malware attack datasets that is available publicly.
Key concepts explained in talk: Sessionisation, Bayesian techniques of Machine Learning, Gaussian Mixture Models, Kernel density estimation, FFT, stochastic periods, probabilistic modelling, Bayesian non-parametric methods
2. Thales Overview
From the Bottom of the Oceans… to the Depths of
Space & Cyberspace
Key Digital Technologies
3. Thales: A Research and Development Powerhouse
6 times winner
2012, 2013,
2015, 2016,
2017, 2018
Expertise in a uniquely broad range
of technical domains, from science
to systems, applied across
businesses.
An extensive intellectual property
portfolio of 20,500 patents.
Albert Fert
Scientific director of the
CNRS/Thales joint physics
unit and winner of the
2007 Nobel prize in
physics.
4. Agenda
• Motivation of studying events
• Concept & purpose of Sessionisation
• Traditional approaches
• Real world case studies
• Applied Data Science way of doing Sessionisation
5. Events
• Orders placed in a market
• Sequence of user tweets
• User’s clicks on a website
• Activity update by an IoT device
• Network events on a router
• Network alarms in a network
12. Sessions: Operations vs Data Science view
Continuity in
activity
Mean activity period
Gap >> Mean activity period
Sessions Sessions Sessions
13. Sessions: Operations vs Data Science view
Continuity in
activity
Chain of time
sequenced events
Mean activity period
Gap >> Mean activity period
Sessions Sessions Sessions
Time based correlation
20. Malicious Actors in the world of AI
• Orders placed in a market: Market manipulation
• Sequence of user tweets : Bot campaigns
• User’s clicks on a website: Fraudulent transactions
• Activity by an IoT device: Taking device control
• Network events on a router: Cyber attacks
22. Approaches for finding time based patterns
• Fourier transform
• Time period – Stochastic periods
• GMM (Gaussian Mixture Models)
• Infinite GMM (Gaussian Mixture
Models)
• Non-parametric Bayesian methods
• Applied data science techniques
Information
Complexity
Applied data
science
30. Case studies via public datasets
• Sessionisation is an essential activity in detecting malicious bot
activities like Beaconing
• We will use 6th dataset of CTU-13 datasets for examples
• Provided by Czech Technical University (CTU)
• Traces captured from a malware attack executed in university network
• 6th dataset simulates a bot named DonBot, it attacks SVC services on Windows
• Dataset: https://mcfp.felk.cvut.cz/publicDatasets/CTU-Malware-Capture-
Botnet-47/bro/conn.log
31. Case – 1: DonBot’s DNS queries to university DNS
server
35. Stochastic periods: Introduction
• Analyze periodicity in time domain
• Compute consecutive time deltas
• Real world signals are noisy so time deltas will vary a lot
• If there is periodicity in the signal, time deltas will vary in a band
• The density plot of time deltas will show some high density regions
• We can learn a probability distributions for each high density region
58. Auto discovering multiple distributions
tation Maximization
w to estimate parameter ?✓
Expectation Maximization
If sources are known,easy:
How to estimate parameter ?✓GMM - Gaussian Mixture Models
59. Auto discovering multiple distributions
sticmodel of data Gaussian Mixture Model (GMM)
GMM - Gaussian Mixture Models
60. GMM – Gaussian Mixture Models
• Does soft clustering of data points instead of hard clustering
• In principal it is very similar to K-Means but works on
probability
• K-Means: {P1 C1, P2 C2}, GMM: {P1 [0.8, 0.1, 0.1], P2 [0.05, 0.85, 0.1]}
• Problem with GMM & K-Means: We need to define “K”
• Techniques like Elbow method, Silhouette, etc. are based on certain assumptions
• Cannot be applied in general for automated discovery of K
• Finding “K” automatically is a very hard problem to solve
C1, C2, C3 C1, C2, C3
61. Bayesian way of building models
𝑃 𝜃 𝑋 =
𝑃 𝑋 𝜃 𝑃(𝜃)
𝑃(𝑋)
PriorLikelihood
Evidence
Posterior
𝑃 𝜃 𝑋 =
𝑃 𝑋 𝜃 𝑃(𝜃)
𝑃(𝑋)
𝑃(𝜃) is conjugate to 𝑃 𝑋 𝜃
A(𝜈’) A(𝜈)
For example:
P(𝜽) = 𝓝(𝜽|0, 1) # Standard normal
P(X|𝜽) = 𝓝(x|𝜽, 1) # with 1 std. dev
𝑃(𝜃|𝑋) ∝ 𝑒−
1
2(𝑥−𝜃)2
𝑒−
1
2 𝜃2
𝑃(𝜃|𝑋) ∝ 𝑒−(𝜃−
𝑥
2)2
P(𝜽|X) = 𝓝(𝜃|
𝑥
2
,
1
2
)
66. Sessionisation: Data Science at scale
• In a real world scenario, be it
• Web users over internet, Network hosts in an enterprise network, etc.
• One would need to apply Sessionisation on millions of entities
• So manual inspection based methods cannot be used
• We need a fully automated system to discover multiple
”Stochastic Periods”
• We need to find the clusters automatically
72. Intuition behind Infinite GMM
Properties of Dirichlet distribution
Dirichlet distribution is conjugate to Multinomial distribution
If π = 𝜋1, 𝜋2, … , 𝜋 𝑘 ~ 𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡(𝛼1, 𝛼2, … , 𝛼 𝑘)
Dirichlet Satisfies expansion or combination rule:
𝝅 𝟏 𝜽, 𝝅 𝟏(𝟏 − 𝜽), 𝝅 𝟐, … , 𝝅 𝒌 ~ 𝑫𝒊𝒓𝒊𝒄𝒉𝒍𝒆𝒕(𝜶 𝟏 𝒃 , 𝜶 𝟏(𝟏 − 𝒃), 𝜶 𝟐, … , 𝜶 𝒌)
Allows to increase the dimensionality of Dirichlet
Where 0 < b < 1 and 𝜃~𝐵𝑒𝑡𝑎(𝛼1 𝑏, 𝛼1 1 − 𝑏 )
73. Intuition behind Infinite GMM
Properties of Dirichlet distribution
Dirichlet distribution is conjugate to Multinomial distribution
If π = 𝜋1, 𝜋2, … , 𝜋 𝑘 ~ 𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡(𝛼1, 𝛼2, … , 𝛼 𝑘)
Dirichlet Satisfies expansion or combination rule:
𝝅 𝟏 𝜽, 𝝅 𝟏(𝟏 − 𝜽), 𝝅 𝟐, … , 𝝅 𝒌 ~ 𝑫𝒊𝒓𝒊𝒄𝒉𝒍𝒆𝒕(𝜶 𝟏 𝒃 , 𝜶 𝟏(𝟏 − 𝒃), 𝜶 𝟐, … , 𝜶 𝒌)
Allows to increase the dimensionality of Dirichlet
Where 0 < b < 1 and 𝜃~𝐵𝑒𝑡𝑎(𝛼1 𝑏, 𝛼1 1 − 𝑏 )
Dirichlet Process π(2)
= 𝜋1
(2)
, 𝜋2
(2)
~ 𝐷𝑖𝑟( 𝛼
2,
𝛼
2) ~ 𝐷𝑖𝑟( 𝛼
4,
𝛼
4,
𝛼
4,
𝛼
4) ~ 𝐷𝑖𝑟( 𝛼
𝐾
,…
𝛼
𝐾
)
𝑲 → ∞
74. Dirichlet process
21 : The Indian Bu↵et Process
Figure 2: On the left is an example of Indian Bu↵et Process dish assign
the right is an example binary matrix generated from IBP.
3. The nth customer helps himself to each dish with probability mk /
dish k was chosen.
4. He tries Poisson(↵/ n) new dishes.
Indian buffet processChinese restaurant process
Chinese restaurant process in action
77. Probabilistic modeling
• Probabilistic models captures the uncertainty better in real
world data
• But it is very computationally intensive
• The sampling process takes time to stabilize and then generate meaningful
results
• Certainly cannot work on large datasets
82. Obtaining stochastic periods recursively
Stochastic
periods
• Get probability distributions
from dense regions
Get dense
regions list
• Find dmin to cluster a
region
Recursively
split regions
• If region is:
{Heavy tailed, Multi-modal}
Kurtosis of normal distribution = 3
Heavy tailed: Excess Kurtosis > 6
𝐵𝑖𝑚𝑜𝑑𝑎𝑙𝑖𝑡𝑦 =
𝛾2
+ 1
𝜅
𝛾: Skewness
𝜅 : Kurtosis
Bimodality
for uniform
distribution 5/9
Bimodality > 0.8
Unimoda
l
Not unimodal
Unimo
dal ?
83. dmin via distance matrix
This topic was covered generically in detail during ODSC 2018 – “Topological space clustering”
84. dmin via distance matrix
This topic was covered generically in detail during ODSC 2018 – “Topological space clustering”
85. dmin via distance matrix
This topic was covered generically in detail during ODSC 2018 – “Topological space clustering”
If local dense regions exists along with
sparsity, then we can obtain hierarchical
clusters at each mode
87. Method proposed:
Finding optimal clustering-epsilon
• The problem comes down to finding the most optimal curve for the
Gaussian kernel
• One of the ways to solve it algorithmically
Grid Search
(band_width,
grid_size)
rFFT
Silverman
Transform
I-
rFFT
Score
(logLoss,
stdDev)
Minima
(band_width, grid_size)
88. Genuine systematic DNS queries to DNS server
Time delta
analysis
Time delta
Density
plot
92. Finite GMM: Bayesian setting
Algorithm: Collapsed Gibbs sampler for a finite Gaussian mixture model
Choose an initial z
For T iterations do # Gibbs sampling iterations
For i = 1 to N do
Remove xi ’s statistics from component zi # Old cluster assignment for xi
For k = 1 to K do # Every possible component
Calculate P(zi = k|zi , α)
Calculate p(xi |Xki , β)
Calculate P(zi = k|zi , X , α, β) ∝ P(zi = k|zi , α) p(xi |Xki , β)
End for
Sample knew from P(zi |zi , X , α, β) after normalizing
Add xi ’s statistics to the component zi = knew # New assignment for xi
End for
End for Evaluation metric for Gibbs: 𝑘=1
𝐾
𝑝 𝑋 𝑘 𝛽 𝑝(𝑧|𝛼)
93. Infinite GMM: Bayesian setting
Choose an initial z
For T iterations do # Gibbs sampling iterations
For i = 1 to N do
Remove xi ’s statistics from component zi # Old cluster assignment for xi
For k = 1 to K do # Every possible component
Calculate P(zi = k|zi , α)
Calculate p(xi |Xki , β)
Calculate P(zi = k|zi , X , α, β) ∝ P(zi = k|zi , α) p(xi |Xki , β)
End for
Calculate P (zi = k∗|zi, α) # Consider a new component
Calculate p(xi|β)
Calculate P (zi = k∗|zi, X , α, β) ∝ P (zi = k∗|zi, α) p(xi|β)
Sample knew from P(zi |zi , X , α, β) after normalizing
If any component is empty, remove it and decrease K
Add xi ’s statistics to the component zi = knew # New assignment for xi
End for
End for
80 000 employees in 68 countries, a global company
Heavy investments in innovation every year to develop state-of-the-art technologies: 1Bn€ invested in self-funded R&D