Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories
1. Dictyogram: a Statistical Approach for the
Definition and Visualization of Network Flow
Categories
David Muelas, Miguel Gordo, Jos´e Luis Garc´ıa-Dorado,
Jorge E. L´opez de Vergara
Email: {dav.muelas, jl.garcia, jorge.lopez vergara}@uam.es,
miguel.gordo@estudiante.uam.es
Universidad Aut´onoma de Madrid
CNSM 2015 – November 2015
2. Network Health Check
Network managers must monitor network vital signs to assure it is
healthy:
(a) ECG
00:00:00 03:20:00 06:40:00 10:00:00 13:20:00 16:40:00 20:00:00 23:20:00
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cat1 Cat2 Cat3 Cat4 Cat5 Cat6 Cat7 Cat8 Cat9 Cat10
(b) Dictyogram (Normalized version)
But. . . What exactly is Dictyogram?
3. Dictyogram (from δ´ικτυo, network in Greek): Method to
graphically trace the network flow behavior versus time. Its
graphical results can be like a network electrogram, showing its
vital signs.
4. Introduction
Method definition
Experimental results
Conclusions
Outline
1 Introduction
Context
Our Goals
2 Method definition
Probability integral transform
Modeling CDFs
3 Experimental results
Model evaluation
Dictyogram visualization
4 Conclusions
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 4
5. Introduction
Method definition
Experimental results
Conclusions
Context
Our Goals
Context
Network flow-based monitoring has been proven useful to
detect network intrusion, malfunction, or other types of
anomalies.
Unfortunately, network managers have to deal with tons of
measurement data, and its interpretation has become a
challenge.
Data summaries: difficult to reach a good trade-off between
detail and simplifications: insufficient data can lead to
restricted or even erroneous conclusions.
Not only the measurements are important from the point of
view of network management: the application of suitable
techniques improves the quality and depth of the knowledge
that can be extracted from measurements.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 5
6. Introduction
Method definition
Experimental results
Conclusions
Context
Our Goals
Our Goals
Our proposal is intended to ease network managers’ work by
proposing a novel approach to study the behavior of network flow
characteristics. Our main goal is to define comprehensive
summaries of network flow data:
Our approach is based in the study of different flow
characteristics’ ECDFs — e.g., flow size or duration
distributions.
Using those ECDFs, we define flow categories using the
integral probability transform — e.g., using decile delimited
intervals.
As we will see, this approach improves the detection of network
anomalies and the visualization of network state.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 6
7. Introduction
Method definition
Experimental results
Conclusions
Probability integral transform
Modeling CDFs
Method description
Probability integral transform:
Let X be a continuous random variable with cumulative
distribution function FX . Then FX (X) follows a uniform
distribution on [0, 1].
(b)
0
0.5
1
(a)
C
i
= F
X
−1
(P
i
)
P
i
And them, we define flow categories using a set of probability
levels using the CDF of certain flow characteristics.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 7
8. Introduction
Method definition
Experimental results
Conclusions
Probability integral transform
Modeling CDFs
Keep an eye on the hypotheses!
25 30 35
0
0.2
0.4
0.6
0.8
1
(b)
0200400600
0
0.2
0.4
0.6
0.8
1
(a)
(c) Gaussian
0 20 40 60
0
0.2
0.4
0.6
0.8
1
(b)
05101520
0
0.2
0.4
0.6
0.8
1
(a)
(d) Poisson
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 8
9. Introduction
Method definition
Experimental results
Conclusions
Probability integral transform
Modeling CDFs
How can we model an CDF?
Glivenko-Cantelli theorem: the ECDF converges to the CDF
as the number of observations increases.
Nonetheless, computational cost increases when we
accumulate all the values of the characteristic under analysis.
Alternative approach: Functional Data Analysis:
Mean Function: Fmean
X =
1
n
n
i=1
FXi
Problem: not robust
Functional Depth:
Maximum depth observation.
Median Function (it is the function that maximizes the
functional depth we use).
Problem: more computationally expensive
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 9
10. Introduction
Method definition
Experimental results
Conclusions
Probability integral transform
Modeling CDFs
Dataset for the evaluation
To asses the advantages of our method, we have use a real
dataset:
Flow records, Spanish Academic Network: more than one
million users, more than 7 years of data.
Exporters: 5 Netflow exporters, different geographical
locations (all of them in Spain).
Packet level sampling: rate of one out of 100 packets.
Period selected for the evaluation of the CDF estimation
methods: 30 days.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 10
11. Introduction
Method definition
Experimental results
Conclusions
Probability integral transform
Modeling CDFs
Analyzing ECDFs to get a model of the typical behavior
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
X: 40
Y: 0.9
X: 44
Y: 0.8
X: 53
Y: 0.7
X: 80
Y: 0.6
X: 149
Y: 0.5
X: 501
Y: 0.4
X: 1452
Y: 0.3
X: 1500
Y: 0.2
X: 3000
Y: 0.1
Flow size (bytes)
P(X>x)
Mean
Deepest
Median
Figure: Comparison between observed CCDFs (orange line, no marker)
for Exporter A, and models obtained using the mean (blue line, circles),
deepest (black line, diamonds) and median (red line, triangles) functions.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 11
12. Introduction
Method definition
Experimental results
Conclusions
Model evaluation
Dictyogram visualization
Empirical comparison (I)
0 5 10 15 20 25 30
0
5
10
x 10
5
A
0 5 10 15 20 25 30
0
5
10
x 10
6
B
0 5 10 15 20 25 30
0
5
10
x 10
7
C
0 5 10 15 20 25 30
0
5
10
x 10
6
D
0 5 10 15 20 25 30
0
5
x 10
6
E
Mean Deepest Median
Figure: Evolution of the Pearson’s test-statistic for all exporters. (Less is
better.)
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 12
13. Introduction
Method definition
Experimental results
Conclusions
Model evaluation
Dictyogram visualization
Empirical comparison (II)
Table: Summary of the evaluation of the different methods to estimate
the CDF.
Exporter Method # Best
A
Mean function 0
Deepest obs. 3
Median function 25
B
Mean function 0
Deepest obs. 6
Median function 22
C
Mean function 20
Deepest obs. 8
Median function 0
D
Mean function 0
Deepest obs. 23
Median function 5
E
Mean function 0
Deepest obs. 28
Median function 0
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 13
14. Introduction
Method definition
Experimental results
Conclusions
Model evaluation
Dictyogram visualization
Final visualization of Dictyogram
03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:00
0
2
4
x 10
4
(a) Mean
Concurrentflowsforeachcategory
03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:00
0
2
4
x 10
4
(b) Deepest Observation
Time of day
03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:00
0
2
4
x 10
4
(c) Median
1
1
1
2
2
2
Figure: Dictyogram representation of fi (t) with their respective size
intervals delimited by the deciles given by (a) mean, (b) deepest observed
ECDF, and (c) median.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 14
15. Introduction
Method definition
Experimental results
Conclusions
Model evaluation
Dictyogram visualization
Final visualization of Dictyogram
00:00:00 03:20:00 06:40:00 10:00:00 13:20:00 16:40:00 20:00:00 23:20:00
0
0.5
1
1.5
2
2.5
3
3.5
4
x 10
4
1 2
Figure: Zoom in the median.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 15
16. Introduction
Method definition
Experimental results
Conclusions
Key remarks
Our method:
Is manager friendly: it provides Statistical summaries based
on certain probability levels, which eases the study of the
flows traversing the network.
Links statistical properties to time evolution: it eases the
detection of changes in the statistical properties of the
characteristics under analysis.
Improves network flow data visualization: it lets control
the resolution of the visualization of the distribution that
network flow characteristics follow.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 16
17. Introduction
Method definition
Experimental results
Conclusions
Future work
We plan to:
Study how to summarize several different network behaviors in
a multivariate uniform distribution, and use other well-known
distributions (and not only uniform) for signatures.
Study the distribution of the Pearson’s test-statistic to detect
anomalous events.
Test the stability of the estimation of the CDF ( to define
some criteria to recalibrate the model).
Explore other representations with higher dimensionality.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 17
19. Introduction
Method definition
Experimental results
Conclusions
Annex: Functional depth
We use the definition given by:
MSn,H(x) = min{SLn(x), ILn(x)} (1)
where
SLn(x) = 1
nλ(I)
n
i=1
λ{t ∈ I : x(t) ≤ xi (t)}
ILn(x) = 1
nλ(I)
n
i=1
λ{t ∈ I : x(t) ≥ xi (t)} (2)
With it, we consider:
Maximum depth observation.
Median Function (it is the function that maximizes the
functional depth we use).
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 19