Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

Dictyogram: a Statistical Approach for the
Definition and Visualization of Network Flow
Categories
David Muelas, Miguel Gordo, José Luis Garc´ıa-Dorado,
Jorge E. López de Vergara
Email: {dav.muelas, jl.garcia, jorge.lopez vergara}@uam.es,
miguel.gordo@estudiante.uam.es
Universidad Autónoma de Madrid
CNSM 2015 – November 2015

Network Health Check
Network managers must monitor network vital signs to assure it is
healthy:
(a) ECG
00:00:00 03:20:00 06:40:00 10:00:00 13:20:00 16:40:00 20:00:00 23:20:00
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cat1 Cat2 Cat3 Cat4 Cat5 Cat6 Cat7 Cat8 Cat9 Cat10
(b) Dictyogram (Normalized version)
But. . . What exactly is Dictyogram?

Dictyogram (from δ´ικτυo, network in Greek): Method to
graphically trace the network ﬂow behavior versus time. Its
graphical results can be like a network electrogram, showing its
vital signs.

Introduction
Method definition
Experimental results
Conclusions
Outline
1 Introduction
Context
Our Goals
2 Method definition
Probability integral transform
Modeling CDFs
3 Experimental results
Model evaluation
Dictyogram visualization
4 Conclusions
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. López de Vergara Dictyogram 4

Introduction
Method definition
Conclusions
Context
Our Goals
Context
Network flow-based monitoring has been proven useful to
detect network intrusion, malfunction, or other types of
anomalies.
Unfortunately, network managers have to deal with tons of
measurement data, and its interpretation has become a
challenge.
Data summaries: difficult to reach a good trade-off between
detail and simplifications: insufficient data can lead to
restricted or even erroneous conclusions.
Not only the measurements are important from the point of
view of network management: the application of suitable
techniques improves the quality and depth of the knowledge
that can be extracted from measurements.

Introduction
Method definition
Conclusions
Context
Our Goals
Our Goals
Our proposal is intended to ease network managers’ work by
proposing a novel approach to study the behavior of network flow
characteristics. Our main goal is to define comprehensive
summaries of network flow data:
Our approach is based in the study of different flow
characteristics’ ECDFs — e.g., flow size or duration
distributions.
Using those ECDFs, we define flow categories using the
integral probability transform — e.g., using decile delimited
intervals.
As we will see, this approach improves the detection of network
anomalies and the visualization of network state.

Introduction
Method definition
Conclusions
Modeling CDFs
Method description
Probability integral transform:
Let X be a continuous random variable with cumulative
distribution function FX . Then FX (X) follows a uniform
distribution on [0, 1].
(b)
0
0.5
1
(a)
C
i
= F
X
−1
(P
i
)
P
i
And them, we define flow categories using a set of probability
levels using the CDF of certain flow characteristics.

Introduction
Method deﬁnition
Conclusions
Modeling CDFs
Keep an eye on the hypotheses!
25 30 35
0
0.2
0.4
0.6
0.8
1
(b)
0200400600
0
0.2
0.4
0.6
0.8
1
(a)
(c) Gaussian
0 20 40 60
0
0.2
0.4
0.6
0.8
1
(b)
05101520
0
0.2
0.4
0.6
0.8
1
(a)
(d) Poisson

Introduction
Method deﬁnition
Conclusions
Modeling CDFs
How can we model an CDF?
Glivenko-Cantelli theorem: the ECDF converges to the CDF
as the number of observations increases.
Nonetheless, computational cost increases when we
accumulate all the values of the characteristic under analysis.
Alternative approach: Functional Data Analysis:
Mean Function: Fmean
X =
1
n
n
i=1
FXi
Problem: not robust
Functional Depth:
Maximum depth observation.
Median Function (it is the function that maximizes the
functional depth we use).
Problem: more computationally expensive

Introduction
Method deﬁnition
Conclusions
Modeling CDFs
Dataset for the evaluation
To asses the advantages of our method, we have use a real
dataset:
Flow records, Spanish Academic Network: more than one
million users, more than 7 years of data.
Exporters: 5 Netflow exporters, diﬀerent geographical
locations (all of them in Spain).
Packet level sampling: rate of one out of 100 packets.
Period selected for the evaluation of the CDF estimation
methods: 30 days.

Introduction
Method deﬁnition
Conclusions
Modeling CDFs
Analyzing ECDFs to get a model of the typical behavior
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
X: 40
Y: 0.9
X: 44
Y: 0.8
X: 53
Y: 0.7
X: 80
Y: 0.6
X: 149
Y: 0.5
X: 501
Y: 0.4
X: 1452
Y: 0.3
X: 1500
Y: 0.2
X: 3000
Y: 0.1
Flow size (bytes)
P(X>x)
Mean
Deepest
Median
Figure: Comparison between observed CCDFs (orange line, no marker)
for Exporter A, and models obtained using the mean (blue line, circles),
deepest (black line, diamonds) and median (red line, triangles) functions.

Introduction
Method deﬁnition
Conclusions
Model evaluation
Empirical comparison (I)
0 5 10 15 20 25 30
0
5
10
x 10
5
A
0 5 10 15 20 25 30
0
5
10
x 10
6
B
0 5 10 15 20 25 30
0
5
10
x 10
7
C
0 5 10 15 20 25 30
0
5
10
x 10
6
D
0 5 10 15 20 25 30
0
5
x 10
6
E
Mean Deepest Median
Figure: Evolution of the Pearson’s test-statistic for all exporters. (Less is
better.)

Introduction
Method deﬁnition
Conclusions
Model evaluation
Empirical comparison (II)
Table: Summary of the evaluation of the diﬀerent methods to estimate
the CDF.
Exporter Method # Best
A
Mean function 0
Deepest obs. 3
Median function 25
B
Mean function 0
Deepest obs. 6
Median function 22
C
Mean function 20
Deepest obs. 8
Median function 0
D
Mean function 0
Deepest obs. 23
Median function 5
E
Mean function 0
Deepest obs. 28
Median function 0

Introduction
Method deﬁnition
Conclusions
Model evaluation
Final visualization of Dictyogram
03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:00
0
2
4
x 10
4
(a) Mean
Concurrentflowsforeachcategory
03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:00
0
2
4
x 10
4
(b) Deepest Observation
Time of day
03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:00
0
2
4
x 10
4
(c) Median
1
1
1
2
2
2
Figure: Dictyogram representation of fi (t) with their respective size
intervals delimited by the deciles given by (a) mean, (b) deepest observed
ECDF, and (c) median.

Introduction
Method deﬁnition
Conclusions
Model evaluation
Final visualization of Dictyogram
00:00:00 03:20:00 06:40:00 10:00:00 13:20:00 16:40:00 20:00:00 23:20:00
0
0.5
1
1.5
2
2.5
3
3.5
4
x 10
4
1 2
Figure: Zoom in the median.

Introduction
Method definition
Conclusions
Key remarks
Our method:
Is manager friendly: it provides Statistical summaries based
on certain probability levels, which eases the study of the
flows traversing the network.
Links statistical properties to time evolution: it eases the
detection of changes in the statistical properties of the
characteristics under analysis.
Improves network flow data visualization: it lets control
the resolution of the visualization of the distribution that
network flow characteristics follow.

Introduction
Method definition
Conclusions
Future work
We plan to:
Study how to summarize several different network behaviors in
a multivariate uniform distribution, and use other well-known
distributions (and not only uniform) for signatures.
Study the distribution of the Pearson’s test-statistic to detect
anomalous events.
Test the stability of the estimation of the CDF ( to define
some criteria to recalibrate the model).
Explore other representations with higher dimensionality.

Introduction
Method deﬁnition
Conclusions
Thank you!
Questions?

Introduction
Method deﬁnition
Conclusions
Annex: Functional depth
We use the deﬁnition given by:
MSn,H(x) = min{SLn(x), ILn(x)} (1)
where
SLn(x) = 1
nλ(I)
n
i=1
λ{t ∈ I : x(t) ≤ xi (t)}
ILn(x) = 1
nλ(I)
n
i=1
λ{t ∈ I : x(t) ≥ xi (t)} (2)
With it, we consider:
Maximum depth observation.
Median Function (it is the function that maximizes the
functional depth we use).

Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

Recommended

Recommended

More Related Content

Similar to Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

Similar to Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories (20)

More from Jorge E. López de Vergara Méndez

More from Jorge E. López de Vergara Méndez (9)

Recently uploaded

Recently uploaded (20)

Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories