DOWNLOAD

KSII The first International Conference on Internet (ICONI) 2009, December 2009
1
Copyright ⓒ 2009 KSII

Classification of Traffic Flows into QoS
Classes by Unsupervised Learning and
KNN Clustering
Yi Zeng1 and Thomas M. Chen2
1
San Diego Supercomputer Center, University of California
San Diego, CA 92093 - USA
[e-mail: yzeng@sdsc.edu]
2
School of Engineering, Swansea University
Swansea, Wales SA2 8PP - UK
[e-mail: t.m.chen@swansea.ac.uk]
*Corresponding author: Thomas M. Chen

Abstract

Traffic classification seeks to assign packet flows to an appropriate quality of service (QoS) class
based on flow statistics without the need to examine packet payloads. Classification proceeds in two
steps. Classification rules are first built by analyzing traffic traces, and then the classification rules are
evaluated using test data. In this paper, we use self-organizing map and K-means clustering as
unsupervised machine learning methods to identify the inherent classes in traffic traces. Three clusters
were discovered, corresponding to transactional, bulk data transfer, and interactive applications. The
K-nearest neighbor classifier was found to be highly accurate for the traffic data and significantly
better compared to a minimum mean distance classifier.

Keywords: Traffic classification, unsupervised learning, k-nearest neighbor, clustering

1. Introduction the network to processing only the IP packet
header.
Network operators and system administrators
In Section 2, we review the previous work in
are interested in the mixture of traffic carried in
traffic classification. Section 3 addresses the
their networks for several reasons. Knowledge
question of useful features and number of QoS
about traffic composition is valuable for
classes. We describe experiments with
network planning, accounting, security, and
unsupervised clustering of real traffic traces to
traffic control. Traffic control includes packet
build classification rules. Given the discovered
scheduling and intelligent buffer management to
QoS classes, Section 4 presents experimental
provide the quality of service (QoS) needed by
evaluation of classification accuracy using k-
applications. It is necessary to determine to
nearest neighbor compared to minimum mean
which applications packets belong, but
distance clustering.
traditional protocol layering principles restrict

This research was supported by a research grant from the IT R&D program of MKE/IITA, the Korean
government [2005-Y-001-04, Development of Next Generation Security Technology]. We express our thanks to
Dr. Richard Berke who checked our manuscript.

2 Zeng et al.: Classification of Traffic Flows into QoS Classes by Clustering

input vector is called the best-matching unit
(BMU), denoted by mc :
2. Related Work x m= i x m
−c m −i
n (1)
i

Research in traffic classification, which avoids where ⋅ is the Euclidean distance, and { i} m
payload inspection, has accelerated over the last are the codebook vectors.
five years. It is generally difficult to compare After finding BMU, the SOM codebook
different approaches, because they vary in the vectors are updated, such that the BMU is
selection of features (some requiring inspection moved closer to the input vector. The
of the packet payload), choice of supervised or topological neighbors of BMU are also treated
unsupervised classification algorithms, and set this way. This procedure moves BMU and its
of classified traffic classes. The wide range of topological neighbors towards the sample
previous approaches can be seen in the vectors. The update rule for the ith codebook
comprehensive survey by Nguyen and Armitage vector is:
[1]. Further complicating comparisons between mi (n + 1) = mi (n) + α r (n)hci (2)
different studies is the fact that classification where n is the training iteration number, x(t) is
performance depends on how the classifier is an input vector randomly selected from the
trained and the test data used to evaluate
input data set at the nth training, α (n is the
r )
accuracy. Unfortunately, a universal set of test
traffic data does not exist to allow uniform learning rate in the nth training, and hi(n is
c )
comparisons of different classifiers. the kernel function around BMU mc . The
A common approach is to classify traffic on kernel function defines the region of influence
the basis of flows instead of individual packets. that x has on the map.
Trussell et al. proposed the distribution of Fig. 1 shows the U-matrix and the
packet lengths as a useful feature [2]. McGregor components planes for the feature variables. The
et al. used a variety of features: packet length U-matrix is a visualization of distance between
statistics, interarrival times, byte counts, neurons, where distance is color coded
connection duration [3]. Flows with similar according to the spectrum shown next to the
features were grouped together using EM map. Blue areas represent codebook vectors
(expectation- maximization) clustering. Having close to each other in input space, i.e., clusters.
found the clusters representing a set of traffic
classes, the features contributing little were
deleted to simplify classification and the
clusters were recomputed with the reduced
feature set. EM clustering was also studied by
Zander, Nguyen, and Armitage [4]. Sequential
forward selection (SFS) was used to reduce the
feature set. The same authors also tried
AutoClass, an unsupervised Bayesian classifier,
for cluster formation and SFS for feature set
reduction [5].

3. Unsupervised Clustering
Fig. 1. U-matrix with 7 components scaled to
3.1 Self-Organizing Map
[0,1].
SOM is trained iteratively. In each training step,
one sample vector x from the input data pool is 3.2 K-Means Clustering
chosen randomly, and the distances between it The K-means clustering algorithm starts with a
and all the SOM codebook vectors are training data set and a given number of clusters
calculated using some distance measure. The K. The samples in the training data set are
neuron whose codebook vector is closest to the assigned to a cluster based on a similarity

KSII The first International Conference on Internet (ICONI) 2009, December 2009
3

measurement. Euclidean distance is generally 5. Conclusions
used to measure the similarity. The K-means
algorithm tries to find an optimal solution by Traffic classification was carried out in two
minimizing the square error: phases. In the first off-line phase, we started
K n with no assumptions about traffic classes and
∑ x −c
∑
2
Er= j i (3) used the unsupervised SOM and K-means
i= j=
1 1 clustering algorithms to find the structure in the
where K is the number of clusters and n is the traffic data. The data exploration procedure
found three clusters corresponding to three QoS
number of training samples, c i is the center of
classes: transactional, interactive, and bulk data
the ith cluster, x−c is the Euclidean distance
i transfer.
between sample x and center c i of the ith In the second classification phase, the
cluster. accuracy of the KNN classifier was evaluated
for test data. Leave-one-out cross-validation
tests showed that this algorithm had a low error
4. Experimental Classification rate. The KNN classifier was found to have an
Results and Analysis error rate of about 2 percent for the test data,
compared to an error rate of 7 percent for a
The previous section identified three clusters for MMD classifier. KNN is one of the simplest
QoS classes and features to build up classification algorithms, but not necessarily the
classification rules through unsupervised most accurate. Other supervised algorithms,
learning. In this section, the accuracy of the such as back propagation (BP) and SVM, also
classification rules is evaluated experimentally. have attractive features and should be compared
For classification, we chose the K-nearest in future work.
neighbor (KNN) algorithm. Experimental
results are compared with the minimum mean
distance (MMD) classifier. References
The selected application lists for each class
and the number of applications in each class are [1] Thuy Nguyen and Grenville Armitage, “A
shown in Table 1. survey of techniques for Internet traffic
classification using machine learning,”
Table 1. Applications in each class IEEE Communications Surveys and
Class Applications Total Tutorials, vo.10, no.4, pp.56-76, 2008.
number [2] H. Trussell, A. Nilsson, P. Patel, and Y.
Wang, “Estimation and detection of
53/TCP, 13/TCP,
Transactional 112 network traffic,” in Proc. of 11th Digital
111/TCP,…
Signal Processing Workshop, pp.246-248,
23/TCP, 21/TCP,
2004.
43/TCP,
[3] Anthony McGregor, Mark Hall, Perry
513/TCP,
Lorier, and James Brunskill, “Flow
514/TCP,
clustering using machine learning
Interactive 540/TCP, 77
techniques,” in Proc. of 5th Int. Workshop
251/TCP,
on Passive and Active Network
1017/TCP, 1019/
Measurement, pp.205-214, 2004.
TCP, 1020/TCP,
[4] Sebastian Zander, Thuy Nguyen, and
1022/TCP,…
Grenville Armitage, “Self-learning IP
80/TCP, 20/TCP,
traffic classification based on statistical
25/TCP, 70/TCP,
flow characteristics,” in Proc. of 6th Int.
79/TCP, 81/TCP,
Workshop on Passive and Active
82/TCP, 83/TCP,
Bulk data 1351 Measurement, pp.325-328, 2005.
84/TCP,
[5] Sebastian Zander, Thuy Nguyen, and
119/TCP,
Grenville Armitage, “Automated traffic
210/TCP,
classification and application identification
8080/TCP,…
using machine learning,” in Proc. of IEEE

4 Zeng et al.: Classification of Traffic Flows into QoS Classes by Clustering

Conf. on Local Computer Networks,
pp.250-257, 2005.

DOWNLOAD

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to DOWNLOAD

Similar to DOWNLOAD (20)

More from butest

More from butest (20)

DOWNLOAD