SlideShare a Scribd company logo
1 of 16
Download to read offline
Discriminators for use in flow-based
classification
ISSN 1470-5559
RR-05-13 updated
August 2005
Department of Computer Science
Andrew Moore, Denis Zuev and Michael Crogan
Discriminators for use in flow-based classification∗
Andrew W. Moore,
Queen Mary, University of London, Department of Computer Science, †
Denis Zuev,
University of Oxford, Mathematical Institute,‡
and
Michael L. Crogan§
August 2004, updated August 2005
Abstract
Any assessment of classification techniques requires data. This document describes sets of data
intended to aid in the assessment of classification work. A number of data sets are described; each data
set consists a number of objects, and each object is described by a group of features (also referred to
as discriminators). Leveraged by a quantity of hand-classified data, each object within each data set
represents a single flow of TCP packets between client and server. The features for each object consist
of the (application-centric) classification derived elsewhere and a number of features derived as input to
probabilistic classification techniques. In addition to describing the features, we also provide information
allowing interested parties to retrieve these data sets for use in their own work. The data sets contain no
site-identifying information; each object is only described by a set of statistics and a class that defines
the causal application.
1 Introduction
This work describes sets of data that we have provided to the research community in order to allow them to
assess their classification techniques on real data. The intention of this work is to provide the provenance of
the discriminators that describe each object as well as to provide background on the formation of the objects
and subsequent limitations of this data set.
We created these data sets as training sets to allow assessment of probabilistic classification techniques.
The information in the features is derived using packet header information alone, while the classification-
class has been derived using a content-based analysis. The content-based classification process is described
in [1, 2].
We have used data collected by the high-performance network monitor described in [3]. We use its loss-
limited, capture to disk providing timestamps with resolution of better than 35 nanoseconds. The site, B,
is a research facility host to about 1,000 users connected to the Internet via a full-duplex Gigabit Ethernet
link. Our data is based upon a 24-hour, full-duplex trace of this research facility.
Section 2 describes how we subsample the 24-hour period, creating 10 separate data sets each from a
different period of the 24-hour day. Section 3 describes how we filter the data of each period to create the
sets of objects (TCP flows) we subsequently characterize. Section 4 describes each of the discriminators
(features) that are used to characterize the objects of each data set. Section 5 notes the address from which
this data may be retrieved, summarizes the features and limitations of these data sets and notes where future
work may take us.
∗Contact author: andrew.moore@dcs.qmul.ac.uk
†This work was completed when Andrew Moore was supported by a research fellowship from the the Intel Corporation.
‡This work was completed when Denis Zuev was employed by Intel Research, Cambridge.
§This work was completed when Michael Crogan was employed by Intel Research, Cambridge.
1
15
45
75
30
60
90
00:00 12:00 18:00
Mbps
Time (GMT)06:00
01 02 03 04 05 06 07 08 09 10 Block
Number
Figure 1: Heuristic illustration of how data blocks were obtained. The line represents the instantaneous
bandwidth requirements during the day, while the dark regions represent the data of each data set.
Data-set Start-time End-time Duration Flows (Objects)
entry01 2003-Aug-20 00:34:21 2003-Aug-20 01:04:43 1821.8 24863
entry02 2003-Aug-20 01:37:37 2003-Aug-20 02:05:54 1696.7 23801
entry03 2003-Aug-20 02:45:19 2003-Aug-20 03:14:03 1724.1 22932
entry04 2003-Aug-20 04:03:31 2003-Aug-20 04:33:15 1784.1 22285
entry05 2003-Aug-20 04:39:10 2003-Aug-20 05:09:05 1794.9 21648
entry06 2003-Aug-20 06:07:28 2003-Aug-20 06:35:06 1658.5 19384
entry07 2003-Aug-20 09:42:17 2003-Aug-20 10:11:16 1739.2 55835
entry08 2003-Aug-20 11:52:40 2003-Aug-20 12:20:26 1665.9 55494
entry09 2003-Aug-20 13:45:37 2003-Aug-20 14:13:21 1664.5 66248
entry10 2003-Aug-20 14:55:44 2003-Aug-20 15:22:37 1613.4 65036
Table 1: Broad statistics of each data set.
2 Data sets
In order to construct the sets of flows, the day trace was split into ten blocks of approximately 1680 seconds
(28 minutes) each. In order to provide a wider sample of mixing across the day, the start of each sample
was selected randomly (uniformly distributed over the whole day trace). Figure 1 illustrates heuristically
our technique. It can be seen from Table 1 that there are a different number of flows in each data block, due
to a variable density of traffic during each constant period. Since time statistics of flows are present in the
analysis, we consider it to be important to keep a fixed time window when selecting flows.
Each data set represents a period of time taken from within the day. Table 1 provides a list of the data
sets along with information about their duration and the number of flows present in each data set. While
each of the data sets represent approximately the same period of time, the number of objects per data set
fluctuates as a result of the variation in activity throughout the course of the day.
While the start times of the ten data sets were selected from a random uniform distribution, there is a clear
bias toward the first 16 hours of the day. As noted in Future work the discriminator-based characterization
is planned for all flows over the whole 24-hour period thereby removing this bias from the data.
2
3 Flow Definition
Each data set is represented as a text file, and consists of multiple lines. Each line represents an object (flow).
In the Internet a flow may be defined as one or more packets traveling between two computer addresses using
a particular protocol (e.g., TCP, UDP, ICMP)—and where appropriate a particular pair of ports (defined
for each end of the flow). This tuple of information (hostsrc, hostdest, portsrc, portdest, protocol) is present
in every packet.
Source and destination elements in the tuple will be reversed for packets traveling in the opposite direction.
In this way a stream of packets may be either half- or full-duplex.
In order to simplify our definitions, we elected to concentrate upon the TCP protocol and TCP flows
only, UDP data is planned in future work. TCP is a stateful protocol and its flows have a well-defined
beginning and end, and subsequently do not lend themselves to the unclear start-end definitions that may
plague a flow of data consisting of individual datagrams alone (such as UDP).
In the simplest case, a complete flow is well defined when both a complete flow setup and tear-down are
observed. The complexity in any flow definition occurs when the setup is incomplete or the tear-down is
abnormal. A set of rules is embodied into netdude [4] that implements the TCP state engine in a sufficiently
robust fashion even in the face of packet loss (as might occur in a network monitoring system).
We use netdude to create a collection of complete TCP flows.
Complete TCP flows have a number of desirable properties. For example, using complete flows allows
us to differentiate client from server, and this allows us to reliably identify the client and server ports.
Additionally, complete TCP flows allow us to compute nearly all the discriminators provided in each data
set.
4 Discriminators
This data set is intended to provide a wide variety of features to characterise flows. This includes simple
statistics about packet length and inter-packet timings, and information derived from the transport protocol
(TCP): such as SYN and ACK counts. This information is provided based on all packets (both directions)
and on each direction individually (server → client and client → server).
Many packet statistics are derived directly by counting packets, and packet header-sizes. A significant
number of features (such as estimates of round-trip time, size of TCP segments, and the total number of
retransmissions) are derived from the TCP headers — we use tcptrace[5] for this information.
We describe each flow using three modes:
idle : no packets between client and server for greater-than or equal-to two seconds,
interactive : data packets moving in both directions, and
bulk : data packets in one direction and only acknowledgments in the other.
We provide features of the flow duration, the time a flow spends in the bulk mode, and the time spent
by the flow in the idle mode.
The effective bandwidth utilisation is a computation of entropy that provides an insight into the activity
of a flow. Such an entropy measure may be computed over a number of different time-scales as a result the
values provided are single points in a continuum.
Finally, from the inter-arrival time of packets (in both directions and each direction individually) we also
provide a ranked list of the ten frequency components that contribute the most to the Fourier transform and
the effective bandwidth utilisation 1
1It may be argued that frequency components based upon binning would be more useful than a ranked list — we plan to
add that in the future.
3
Table 2: Discriminators and Definitions
Number Short Long
1 Server Port Port Number at server; we can establish server and
client ports as we limit ourselves to flows for which we
see the initial connection set-up.
2 Client Port Port Number at client
3 min IAT Minimum packet inter-arrival time for all packets of
the flow (considering both directions).
4 q1 IAT First quartile inter-arrival time
5 med IAT Median inter-arrival time
6 mean IAT Mean inter-arrival time
7 q3 IAT Third quartile packet inter-arrival time
8 max IAT Maximum packet inter-arrival time
9 var IAT Variance in packet inter-arrival time
10 min data wire Minimum of bytes in (Ethernet) packet, using the size
of the packet on the wire.
11 q1 data wire First quartile of bytes in (Ethernet) packet
12 med data wire Median of bytes in (Ethernet) packet
13 mean data wire Mean of bytes in (Ethernet) packet
14 q3 data wire Third quartile of bytes in (Ethernet) packet
15 max data wire Maximum of bytes in (Ethernet) packet
16 var data wire Variance of bytes in (Ethernet) packet
17 min data ip Minimum of total bytes in IP packet, using the size of
payload declared by the IP packet
18 q1 data ip First quartile of total bytes in IP packet
19 med data ip Median of total bytes in IP packet
20 mean data ip Mean of total bytes in IP packet
21 q3 data ip Third quartile of total bytes in IP packet
22 max data ip Maximum of total bytes in IP packet
23 var data ip Variance of total bytes in IP packet
24 min data control Minimum of control bytes in packet, size of the
(IP/TCP) packet header
25 q1 data control First quartile of control bytes in packet
26 med data control Median of control bytes in packet
27 mean data control Mean of control bytes in packet
28 q3 data control Third quartile of control bytes in packet
29 max data control Maximum of control bytes in packet
30 var data control Variance of control bytes packet
31 total packets a b The total number of packets seen (client→server).
32 total packets b a ” (server→client)
Continued on next page
4
Number Short Long
33 ack pkts sent a b The total number of ack packets seen (TCP segments
seen with the ACK bit set) (client→server).
34 ack pkts sent b a ” (server→client)
35 pure acks sent a b The total number of ack packets seen that were not
piggy-backed with data (just the TCP header and
no TCP data payload) and did not have any of the
SYN/FIN/RST flags set (client→server)
36 pure acks sent b a ” (server→client)
37 sack pkts sent a b The total number of ack packets seen carrying TCP
SACK [6] blocks (client→server)
38 sack pkts sent b a ” (server→client)
39 dsack pkts sent a b The total number of sack packets seen that carried
duplicate SACK (D-SACK) [7] blocks. (client→server)
40 dsack pkts sent b a ” (server→client)
41 max sack blks/ack a b The maximum number of sack blocks seen in any sack
packet. (client→server)
42 max sack blks/ack b a ” (server→client)
43 unique bytes sent a b The number of unique bytes sent, i.e., the total bytes of
data sent excluding retransmitted bytes and any bytes
sent doing window probing. (client→server)
44 unique bytes sent b a ” (server→client)
45 actual data pkts a b The count of all the packets with at least a byte of
TCP data payload. (client→server)
46 actual data pkts b a ” (server→client)
47 actual data bytes a b The total bytes of data seen. Note that this includes
bytes from retransmissions / window probe packets if
any. (client→server)
48 actual data bytes b a ” (server→client)
49 rexmt data pkts a b The count of all the packets found to be retransmis-
sions. (client→server)
50 rexmt data pkts b a ” (server→client)
51 rexmt data bytes a b The total bytes of data found in the retransmitted
packets. (client→server)
52 rexmt data bytes b a ” (server→client)
53 zwnd probe pkts a b The count of all the window probe packets seen. (Win-
dow probe packets are typically sent by a sender when
the receiver last advertised a zero receive window, to
see if the window has opened up now). (client→server)
54 zwnd probe pkts b a ” (server→client)
55 zwnd probe bytes a b The total bytes of data sent in the window probe pack-
ets. (client→server)
56 zwnd probe bytes b a ” (server→client)
57 outoforder pkts a b The count of all the packets that were seen to arrive
out of order. (client→server)
Continued on next page
5
Number Short Long
58 outoforder pkts b a ” (server→client)
59 pushed data pkts a b The count of all the packets seen with the PUSH bit
set in the TCP header. (client→server)
60 pushed data pkts b a ” (server→client)
61 SYN pkts sent a b The count of all the packets seen with the SYN bits
set in the TCP header respectively (client→server)
62 FIN pkts sent a b The count of all the packets seen with the FIN bits set
in the TCP header respectively (client→server)
63 SYN pkts sent b a The count of all the packets seen with the SYN bits
set in the TCP header respectively (server→client)
64 FIN pkts sent b a The count of all the packets seen with the FIN bits set
in the TCP header respectively (server→client)
65 req 1323 ws a b If the endpoint requested Window Scaling/Time
Stamp options as specified in RFC 1323[8] a ‘Y’ is
printed on the respective field. If the option was not
requested, an ‘N’ is printed. For example, an “N/Y”
in this field means that the window-scaling option was
not specified, while the Time-stamp option was speci-
fied in the SYN segment. (client→server)
66 req 1323 ts a b . . .
67 req 1323 ws b a If the endpoint requested Window Scaling/Time
Stamp options as specified in RFC 1323[8] a ‘Y’ is
printed on the respective field. If the option was not
requested, an ‘N’ is printed. For example, an “N/Y”
in this field means that the window-scaling option was
not specified, while the Time-stamp option was speci-
fied in the SYN segment. (client→server)
68 req 1323 ts b a . . .
69 adv wind scale a b The window scaling factor used. Again, this field is
valid only if the connection was captured fully to in-
clude the SYN packets. Since the connection would
use window scaling if and only if both sides requested
window scaling [8], this field is reset to 0 (even if a
window scale was requested in the SYN packet for this
direction), if the SYN packet in the reverse direction
did not carry the window scale option. (client→server)
70 adv wind scale b a ” (server→client)
71 req sack a b If the end-point sent a SACK permitted option in the
SYN packet opening the connection, a ‘Y’ is printed;
otherwise ‘N’ is printed. (client→server)
72 req sack b a ” (server→client)
73 sacks sent a b The total number of ACK packets seen carrying SACK
information. (client→server)
74 sacks sent b a ” (server→client)
Continued on next page
6
Number Short Long
75 urgent data pkts a b The total number of packets with the URG bit turned
on in the TCP header. (client→server)
76 urgent data pkts b a ” (server→client)
77 urgent data bytes a b The total bytes of urgent data sent. This field is cal-
culated by summing the urgent pointer offset values
found in packets having the URG bit set in the TCP
header. (client→server)
78 urgent data bytes b a ” (server→client)
79 mss requested a b The Maximum Segment Size (MSS) requested as a
TCP option in the SYN packet opening the connec-
tion. (client→server)
80 mss requested b a ” (server→client)
81 max segm size a b The maximum segment size observed during the life-
time of the connection. (client→server)
82 max segm size b a ” (server→client)
83 min segm size a b The minimum segment size observed during the life-
time of the connection. (client→server)
84 min segm size b a ” (server→client)
85 avg segm size a b The average segment size observed during the lifetime
of the connection calculated as the value reported in
the actual data bytes field divided by the actual data
pkts reported. (client→server)
86 avg segm size b a ” (server→client)
87 max win adv a b The maximum window advertisement seen. If the con-
nection is using window scaling (both sides negoti-
ated window scaling during the opening of the con-
nection), this is the maximum window-scaled adver-
tisement seen in the connection. For a connection us-
ing window scaling, both the SYN segments opening
the connection have to be captured in the dumpfile for
this and the following window statistics to be accurate.
(client→server)
88 max win adv b a ” (server→client)
89 min win adv a b The minimum window advertisement seen. This is the
minimum window-scaled advertisement seen if both
sides negotiated window scaling. (client→server)
90 min win adv b a ” (server→client)
91 zero win adv a b The number of times a zero receive window was adver-
tised. (client→server)
92 zero win adv b a ” (server→client)
93 avg win adv a b The average window advertisement seen, calculated
as the sum of all window advertisements divided by
the total number of packets seen. If the connection
endpoints negotiated window scaling, this average is
calculated as the sum of all window-scaled advertise-
ments divided by the number of window-scaled packets
seen. Note that in the window-scaled case, the win-
dow advertisements in the SYN packets are excluded
since the SYN packets themselves cannot have their
window advertisements scaled, as per RFC 1323 [8].
(client→server)
Continued on next page
7
Number Short Long
94 avg win adv b a ” (server→client)
95 initial window-bytes a b The total number of bytes sent in the initial window
i.e., the number of bytes seen in the initial flight of data
before receiving the first ack packet from the other
endpoint. Note that the ack packet from the other
endpoint is the first ack acknowledging some data (the
ACKs part of the 3-way handshake do not count), and
any retransmitted packets in this stage are excluded.
(client→server)
96 initial window-bytes b a ” (server→client)
97 initial window-packets a b The total number of segments (packets) sent in the
initial window as explained above. (client→server)
98 initial window-packets b a ” (server→client)
99 ttl stream length a b The Theoretical Stream Length. This is calculated as
the difference between the sequence numbers of the
SYN and FIN packets, giving the length of the data
stream seen. Note that this calculation is aware of
sequence space wrap-arounds, and is printed only if
the connection was complete (both the SYN and FIN
packets were seen). (client→server)
100 ttl stream length b a ” (server→client)
101 missed data a b The missed data, calculated as the difference be-
tween the ttl stream length and unique bytes sent.
If the connection was not complete, this calculation
is invalid and an “NA” (Not Available) is printed.
(client→server)
102 missed data b a ” (server→client)
103 truncated data a b The truncated data, calculated as the total bytes of
data truncated during packet capture. For example,
with tcpdump, the snaplen option can be set to 64
(with -s option) so that just the headers of the packet
(assuming there are no options) are captured, truncat-
ing most of the packet data. In an Ethernet with max-
imum segment size of 1500 bytes, this would amount
to truncated data of 1500 64 = 1436bytes for a packet.
(client→server)
104 truncated data b a ” (server→client)
105 truncated packets a b The total number of packets truncated as explained
above. (client→server)
106 truncated packets b a ” (server→client)
107 data xmit time a b Total data transmit time, calculated as the differ-
ence between the times of capture of the first and
last packets carrying non-zero TCP data payload.
(client→server)
Continued on next page
8
Number Short Long
108 data xmit time b a ” (server→client)
109 idletime max a b Maximum idle time, calculated as the maximum time
between consecutive packets seen in the direction.
(client→server)
110 idletime max b a ” (server→client)
111 throughput a b The average throughput calculated as the unique bytes
sent divided by the elapsed time i.e., the value reported
in the unique bytes sent field divided by the elapsed
time (the time difference between the capture of the
first and last packets in the direction). (client→server)
112 throughput b a ” (server→client)
113 RTT samples a b The total number of Round-Trip Time (RTT) sam-
ples found. tcptrace is pretty smart about choosing
only valid RTT samples. An RTT sample is found
only if an ack packet is received from the other end-
point for a previously transmitted packet such that
the acknowledgment value is 1 greater than the last
sequence number of the packet. Further, it is required
that the packet being acknowledged was not retrans-
mitted, and that no packets that came before it in
the sequence space were retransmitted after the packet
was transmitted. Note : The former condition invali-
dates RTT samples due to the retransmission ambigu-
ity problem, and the latter condition invalidates RTT
samples since it could be the case that the ack packet
could be cumulatively acknowledging the retransmit-
ted packet, and not necessarily ack-ing the packet in
question. (client→server)
114 RTT samples b a ” (server→client)
115 RTT min a b The minimum RTT sample seen. (client→server)
116 RTT min b a ” (server→client)
117 RTT max a b The maximum RTT sample seen. (client→server)
118 RTT max b a ” (server→client)
119 RTT avg a b The average value of RTT found, calculated
straightforward-ly as the sum of all the RTT values
found divided by the total number of RTT samples.
(client→server)
120 RTT avg b a ” (server→client)
121 RTT stdv a b The standard deviation of the RTT samples.
(client→server)
122 RTT stdv b a ” (server→client)
123 RTT from 3WHS a b The RTT value calculated from the TCP 3-Way
Hand-Shake (connection opening) [9], assuming that
the SYN packets of the connection were captured.
(client→server)
Continued on next page
9
Number Short Long
124 RTT from 3WHS b a ” (server→client)
125 RTT full sz smpls a b The total number of full-size RTT samples, calculated
from the RTT samples of full-size segments. Full-size
segments are defined to be the segments of the largest
size seen in the connection. (client→server)
126 RTT full sz smpls b a ” (server→client)
127 RTT full sz min a b The minimum full-size RTT sample. (client→server)
128 RTT full sz min b a ” (server→client)
129 RTT full sz max a b The maximum full-size RTT sample. (client→server)
130 RTT full sz max b a ” (server→client)
131 RTT full sz avg a b The average full-size RTT sample. (client→server)
132 RTT full sz avg b a ” (server→client)
133 RTT full sz stdev a b The standard deviation of full-size RTT samples.
(client→server)
134 RTT full sz stdev b a ” (server→client)
135 post-loss acks a b The total number of ack packets received after losses
were detected and a retransmission occurred. More
precisely, a post-loss ack is found to occur when an
ack packet acknowledges a packet sent (acknowledg-
ment value in the ack pkt is 1 greater than the packet’s
last sequence number), and at least one packet occur-
ring before the packet acknowledged, was retransmit-
ted later. In other words, the ack packet is received
after we observed a (perceived) loss event and are re-
covering from it. (client→server)
136 post-loss acks b a ” (server→client)
137 segs cum acked a b The count of the number of segments that were cumu-
latively acknowledged and not directly acknowledged.
(client→server)
138 segs cum acked b a ” (server→client)
139 duplicate acks a b The total number of duplicate acknowledgments re-
ceived. (client→server)
140 duplicate acks b a ” (server→client)
141 triple dupacks a b The total number of triple duplicate acknowledgments
received (three duplicate acknowledgments acknowl-
edging the same segment), a condition commonly used
to trigger the fast-retransmit/fast-recovery phase of
TCP. (client→server)
142 triple dupacks b a ” (server→client)
143 max # retrans a b The maximum number of retransmissions seen for
any segment during the lifetime of the connection.
(client→server)
Continued on next page
10
Number Short Long
144 max # retrans b a ” (server→client)
145 min retr time a b The minimum time seen between any two
(re)transmissions of a segment amongst all the
retransmissions seen. (client→server)
146 min retr time b a ” (server→client)
147 max retr time a b The maximum time seen between any two
(re)transmissions of a segment. (client→server)
148 max retr time b a ” (server→client)
149 avg retr time a b The average time seen between any two
(re)transmissions of a segment calculated from
all the retransmissions. (client→server)
150 avg retr time b a ” (server→client)
151 sdv retr time a b The standard deviation of the retransmission-time
samples obtained from all the retransmissions.
(client→server)
152 sdv retr time b a ” (server→client)
153 min data wire a b Minimum number of bytes in (Ethernet) packet
(client→server)
154 q1 data wire a b First quartile of bytes in (Ethernet) packet
155 med data wire a b Median of bytes in (Ethernet) packet
156 mean data wire a b Mean of bytes in (Ethernet) packet
157 q3 data wire a b Third quartile of bytes in (Ethernet) packet
158 max data wire a b Maximum of bytes in (Ethernet) packet
159 var data wire a b Variance of bytes in (Ethernet) packet
160 min data ip a b Minimum number of total bytes in IP packet
161 q1 data ip a b First quartile of total bytes in IP packet
162 med data ip a b Median of total bytes in IP packet
163 mean data ip a b Mean of total bytes in IP packet
164 q3 data ip a b Third quartile of total bytes in IP packet
165 max data ip a b Maximum of total bytes in IP packet
166 var data ip a b Variance of total bytes in IP packet
167 min data control a b Minimum of control bytes in packet
168 q1 data control a b First quartile of control bytes in packet
169 med data control a b Median of control bytes in packet
170 mean data control a b Mean of control bytes in packet
171 q3 data control a b Third quartile of control bytes in packet
172 max data control a b Maximum of control bytes in packet
173 var data control a b Variance of control bytes packet
174 min data wire b a Minimum number of bytes in (Ethernet) packet
(server→client)
175 q1 data wire b a First quartile of bytes in (Ethernet) packet
176 med data wire b a Median of bytes in (Ethernet) packet
177 mean data wire b a Mean of bytes in (Ethernet) packet
Continued on next page
11
Number Short Long
178 q3 data wire b a Third quartile of bytes in (Ethernet) packet
179 max data wire b a Maximum of bytes in (Ethernet) packet
180 var data wire b a Variance of bytes in (Ethernet) packet
181 min data ip b a Minimum number of total bytes in IP packet
182 q1 data ip b a First quartile of total bytes in IP packet
183 med data ip b a Median of total bytes in IP packet
184 mean data ip b a Mean of total bytes in IP packet
185 q3 data ip b a Third quartile of total bytes in IP packet
186 max data ip b a Maximum of total bytes in IP packet
187 var data ip b a Variance of total bytes in IP packet
188 min data control b a Minimum of control bytes in packet
189 q1 data control b a First quartile of control bytes in packet
190 med data control b a Median of control bytes in packet
191 mean data control b a Mean of control bytes in packet
192 q3 data control b a Third quartile of control bytes in packet
193 max data control b a Maximum of control bytes in packet
194 var data control b a Variance of control bytes packet
195 min IAT a b Minimum of packet inter-arrival time (client→server)
196 q1 IAT a b First quartile of packet inter-arrival time
197 med IAT a b Median of packet inter-arrival time
198 mean IAT a b Mean of packet inter-arrival time
199 q3 IAT a b Third quartile of packet inter-arrival time
200 max IAT a b Maximum of packet inter-arrival time
201 var IAT a b Variance of packet inter-arrival time
202 min IAT b a Minimum of packet inter-arrival time (server→client)
203 q1 IAT b a First quartile of packet inter-arrival time
204 med IAT b a Median of packet inter-arrival time
205 mean IAT b a Mean of packet inter-arrival time
206 q3 IAT b a Third quartile of packet inter-arrival time
207 max IAT b a Maximum of packet inter-arrival time
208 var IAT b a Variance of packet inter-arrival time
209 Time since last connection Time since the last connection between these hosts
210 No. transitions bulk/trans The number of transitions between transaction mode
and bulk transfer mode, where bulk transfer mode is
defined as the time when there are more than three
successive packets in the same direction without any
packets carrying data in the other direction
211 Time spent in bulk Amount of time spent in bulk transfer mode
212 Duration Connection duration
213 % bulk Percent of time spent in bulk transfer
214 Time spent idle The time spent idle (where idle time is the accumu-
lation of all periods of 2 seconds or greater when no
packet was seen in either direction)
Continued on next page
12
Number Short Long
215 % idle Percent of time spent idle
216 Effective Bandwidth Effective Bandwidth based upon entropy [10] (both
directions)
217 Effective Bandwidth a b ” (client→server)
218 Effective Bandwidth b a ” (server→client)
219 FFT all FFT of packet IAT (arctan of the top-ten frequencies
ranked by the magnitude of their contribution) (all
traffic) (Frequency #1)
220 FFT all ” (Frequency #2)
221 FFT all ” . . .
222 FFT all ” . . .
223 FFT all ” . . .
224 FFT all ” . . .
225 FFT all ” . . .
226 FFT all ” . . .
227 FFT all ” . . .
228 FFT all ” (Frequency #10)
229 FFT a b FFT of packet IAT (arctan of the top-ten frequen-
cies ranked by the magnitude of their contribution)
(client→server) (Frequency #1)
230 FFT a b ” (Frequency #2)
231 FFT a b ” . . .
232 FFT a b ” . . .
233 FFT a b ” . . .
234 FFT a b ” . . .
235 FFT a b ” . . .
236 FFT a b ” . . .
237 FFT a b ” . . .
238 FFT b a ” (Frequency #10)
239 FFT b a FFT of packet IAT (arctan of the top-ten frequen-
cies ranked by the magnitude of their contribution)
(server→client) (Frequency #1)
240 FFT b a ” (Frequency #2)
241 FFT b a ” . . .
242 FFT b a ” . . .
243 FFT b a ” . . .
244 FFT b a ” . . .
245 FFT b a ” . . .
246 FFT b a ” . . .
247 FFT b a ” . . .
248 FFT b a ” (Frequency #10)
249 Classes Application class, as assigned in [1]
5 Conclusion
The pre-computed discriminator-data sets are available off-of
http://www.dcs.qmul.ac.uk/research/nrl/ and http://www.cl.cam.ac.uk/Research/SRG/netos/
nprobe/data/papers/sigmetrics/index.html. While we make every effort to ensure they are without flaw
— and that these archives are maintained — they are provided on an as-is basis.
Additionally, scripts/code to allow the community to generate discriminators for their own data are
available from the above web-sites.
13
Future Work
It is the intention of the authors to provide sets of discriminators and input data for other traffic as it
becomes available.
Thanks
We thank Matt Roughan for some illuminating conversations pertaining to this work and we are indebted
to a number of tool makers notably, everyone who made tcpdump/libpcap the backbone to our research it
is today, Christian Kreibich for his netdude tools and Shawn Ostermann for the tcptrace toolkit.
References
[1] A. Moore and Konstantina Papagiannaki. Toward the accurate identification of network applications,
2005. http://www.cl.cam.ac.uk/ awm22/publication/moore2005toward.pdf.
[2] A. W. Moore. Discrete content-based classification — a data set. Technical report, Intel Research,
Cambridge, 2005.
[3] Andrew Moore, James Hall, Christian Kreibich, Euan Harris, and Ian Pratt. Architecture of a Network
Monitor. In Passive & Active Measurement Workshop 2003 (PAM2003), April 2003.
[4] Sourceforge. netdude, 2005. http://netdude.sourceforge.net.
[5] Shawn Ostermann. tcptrace, 2003. http://www.tcptrace.org.
[6] M. Mathis and J. Mahdavi and S. Floyd and A. Romanow. TCP Selective Acknowledgement Options,
October 1996. M.MATHIS, J.MAHDAVI, S.FLOYD, AND A.ROMANOW. TCP Selective Acknowl-
edgement Options, October 1996.
[7] S. Floyd and J. Mahdavi and M. Mathis and M. Podolsky. An Extension to the Selective Acknowledge-
ment (SACK) Option for TCP, July 2000. S.FLOYD, J.MAHDAVI, M.MATHIS, AND M.PODOLSKY.
An Extension to the Selective Acknowledgement (SACK) Option for TCP, July 2000.
[8] T. Berners-Lee and R. Fielding and H. Frystyk. Hypertext Transfer Protocol – HTTP/1.0, May 1996.
T.BERNERS-LEE, R.FIELDING, AND H.FRYSTYK. Hypertext Transfer Protocol - HTTP/1.0, May
1996.
[9] J. Postel. Transmission Control Protocol, September 1981. INFORMATION SCIENCES INSTITUTE,
U. O. S. C. Transmission Control Protocol, September 1981.
[10] N. G. Duffield, J. T. Lewis, N. O’Connell, R. Russell, and F. Toomey. Entropy of ATM traffic streams.
IEEE Journal on Selected Areas in Communications, 13(6):981–990, August 1995.
14

More Related Content

What's hot

Dc ch11 : routing in switched networks
Dc ch11 : routing in switched networksDc ch11 : routing in switched networks
Dc ch11 : routing in switched networksSyaiful Ahdan
 
SIMULATION PROCESS FOR MOBILE NODES INFORMATION USING LOCATION-AIDED ROUTING ...
SIMULATION PROCESS FOR MOBILE NODES INFORMATION USING LOCATION-AIDED ROUTING ...SIMULATION PROCESS FOR MOBILE NODES INFORMATION USING LOCATION-AIDED ROUTING ...
SIMULATION PROCESS FOR MOBILE NODES INFORMATION USING LOCATION-AIDED ROUTING ...ijwmn
 
computer Netwoks - network layer
computer Netwoks - network layercomputer Netwoks - network layer
computer Netwoks - network layerSendhil Kumar
 
Part 10 : Routing in IP networks and interdomain routing with BGP
Part 10 : Routing in IP networks and interdomain routing with BGPPart 10 : Routing in IP networks and interdomain routing with BGP
Part 10 : Routing in IP networks and interdomain routing with BGPOlivier Bonaventure
 
IRJET- Assessment of Network Protocol Packet Analysis in IPV4 and IPV6 on Loc...
IRJET- Assessment of Network Protocol Packet Analysis in IPV4 and IPV6 on Loc...IRJET- Assessment of Network Protocol Packet Analysis in IPV4 and IPV6 on Loc...
IRJET- Assessment of Network Protocol Packet Analysis in IPV4 and IPV6 on Loc...IRJET Journal
 
Part 2 : reliable transmission and building a network
Part 2 : reliable transmission and building a networkPart 2 : reliable transmission and building a network
Part 2 : reliable transmission and building a networkOlivier Bonaventure
 
Enhancing the NS-2 Traffic Generator for the MANETs
Enhancing the NS-2 Traffic Generator for the MANETsEnhancing the NS-2 Traffic Generator for the MANETs
Enhancing the NS-2 Traffic Generator for the MANETsIOSR Journals
 
Fault tolerant wireless sensor mac protocol for efficient collision avoidance
Fault tolerant wireless sensor mac protocol for efficient collision avoidanceFault tolerant wireless sensor mac protocol for efficient collision avoidance
Fault tolerant wireless sensor mac protocol for efficient collision avoidancegraphhoc
 
Layer3protocols
Layer3protocolsLayer3protocols
Layer3protocolsassinha
 
A Survey on Routing Protocols in Wireless Sensor Network Using Mobile Sink
A Survey on Routing Protocols in Wireless Sensor Network Using Mobile SinkA Survey on Routing Protocols in Wireless Sensor Network Using Mobile Sink
A Survey on Routing Protocols in Wireless Sensor Network Using Mobile SinkIJEEE
 
E NERGY - E FFICIENT P ATH C ONFIGURATION M ETHOD FOR DEF IN WSN S
E NERGY - E FFICIENT  P ATH  C ONFIGURATION  M ETHOD FOR  DEF IN  WSN SE NERGY - E FFICIENT  P ATH  C ONFIGURATION  M ETHOD FOR  DEF IN  WSN S
E NERGY - E FFICIENT P ATH C ONFIGURATION M ETHOD FOR DEF IN WSN Sijitjournal
 
Lte quick reference ------159
Lte quick reference ------159Lte quick reference ------159
Lte quick reference ------159中琳 李中琳
 
cis97003
cis97003cis97003
cis97003perfj
 
On minimizing data forwarding schedule in multi transmit receive wireless mes...
On minimizing data forwarding schedule in multi transmit receive wireless mes...On minimizing data forwarding schedule in multi transmit receive wireless mes...
On minimizing data forwarding schedule in multi transmit receive wireless mes...redpel dot com
 
Lecture 5
Lecture 5Lecture 5
Lecture 5ntpc08
 

What's hot (20)

Dc ch11 : routing in switched networks
Dc ch11 : routing in switched networksDc ch11 : routing in switched networks
Dc ch11 : routing in switched networks
 
Network Layer
Network LayerNetwork Layer
Network Layer
 
SIMULATION PROCESS FOR MOBILE NODES INFORMATION USING LOCATION-AIDED ROUTING ...
SIMULATION PROCESS FOR MOBILE NODES INFORMATION USING LOCATION-AIDED ROUTING ...SIMULATION PROCESS FOR MOBILE NODES INFORMATION USING LOCATION-AIDED ROUTING ...
SIMULATION PROCESS FOR MOBILE NODES INFORMATION USING LOCATION-AIDED ROUTING ...
 
Final report
Final reportFinal report
Final report
 
computer Netwoks - network layer
computer Netwoks - network layercomputer Netwoks - network layer
computer Netwoks - network layer
 
Part 10 : Routing in IP networks and interdomain routing with BGP
Part 10 : Routing in IP networks and interdomain routing with BGPPart 10 : Routing in IP networks and interdomain routing with BGP
Part 10 : Routing in IP networks and interdomain routing with BGP
 
IRJET- Assessment of Network Protocol Packet Analysis in IPV4 and IPV6 on Loc...
IRJET- Assessment of Network Protocol Packet Analysis in IPV4 and IPV6 on Loc...IRJET- Assessment of Network Protocol Packet Analysis in IPV4 and IPV6 on Loc...
IRJET- Assessment of Network Protocol Packet Analysis in IPV4 and IPV6 on Loc...
 
Part 2 : reliable transmission and building a network
Part 2 : reliable transmission and building a networkPart 2 : reliable transmission and building a network
Part 2 : reliable transmission and building a network
 
Enhancing the NS-2 Traffic Generator for the MANETs
Enhancing the NS-2 Traffic Generator for the MANETsEnhancing the NS-2 Traffic Generator for the MANETs
Enhancing the NS-2 Traffic Generator for the MANETs
 
Fault tolerant wireless sensor mac protocol for efficient collision avoidance
Fault tolerant wireless sensor mac protocol for efficient collision avoidanceFault tolerant wireless sensor mac protocol for efficient collision avoidance
Fault tolerant wireless sensor mac protocol for efficient collision avoidance
 
Layer3protocols
Layer3protocolsLayer3protocols
Layer3protocols
 
A Survey on Routing Protocols in Wireless Sensor Network Using Mobile Sink
A Survey on Routing Protocols in Wireless Sensor Network Using Mobile SinkA Survey on Routing Protocols in Wireless Sensor Network Using Mobile Sink
A Survey on Routing Protocols in Wireless Sensor Network Using Mobile Sink
 
Layer 3
Layer 3Layer 3
Layer 3
 
E NERGY - E FFICIENT P ATH C ONFIGURATION M ETHOD FOR DEF IN WSN S
E NERGY - E FFICIENT  P ATH  C ONFIGURATION  M ETHOD FOR  DEF IN  WSN SE NERGY - E FFICIENT  P ATH  C ONFIGURATION  M ETHOD FOR  DEF IN  WSN S
E NERGY - E FFICIENT P ATH C ONFIGURATION M ETHOD FOR DEF IN WSN S
 
Ospf
OspfOspf
Ospf
 
data transmission
data transmission data transmission
data transmission
 
Lte quick reference ------159
Lte quick reference ------159Lte quick reference ------159
Lte quick reference ------159
 
cis97003
cis97003cis97003
cis97003
 
On minimizing data forwarding schedule in multi transmit receive wireless mes...
On minimizing data forwarding schedule in multi transmit receive wireless mes...On minimizing data forwarding schedule in multi transmit receive wireless mes...
On minimizing data forwarding schedule in multi transmit receive wireless mes...
 
Lecture 5
Lecture 5Lecture 5
Lecture 5
 

Viewers also liked

Abdelrahman Saed Ibrahim - CV-20112016
Abdelrahman Saed Ibrahim - CV-20112016Abdelrahman Saed Ibrahim - CV-20112016
Abdelrahman Saed Ibrahim - CV-20112016Abdelrahman Ibrahim
 
Advanced Java Programming
Advanced Java ProgrammingAdvanced Java Programming
Advanced Java ProgrammingAmbo University
 
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳,파퍼 직거래,파퍼 후불판매,파퍼 후불구입,
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳,파퍼 직거래,파퍼 후불판매,파퍼 후불구입,첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳,파퍼 직거래,파퍼 후불판매,파퍼 후불구입,
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳,파퍼 직거래,파퍼 후불판매,파퍼 후불구입,agasg agas
 
Poder ciudadano y electoral
Poder ciudadano y electoralPoder ciudadano y electoral
Poder ciudadano y electoralyohana angulo
 
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳,파퍼 사기없는곳,파퍼 데이트 강간약,파퍼 구조식,
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳,파퍼 사기없는곳,파퍼 데이트 강간약,파퍼 구조식,첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳,파퍼 사기없는곳,파퍼 데이트 강간약,파퍼 구조식,
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳,파퍼 사기없는곳,파퍼 데이트 강간약,파퍼 구조식,agasg agas
 
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳파퍼 팝니다,파퍼 추천,파퍼 종류,
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳파퍼 팝니다,파퍼 추천,파퍼 종류,첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳파퍼 팝니다,파퍼 추천,파퍼 종류,
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳파퍼 팝니다,파퍼 추천,파퍼 종류,agasg agas
 
Chatbots DDD North2016
Chatbots DDD North2016Chatbots DDD North2016
Chatbots DDD North2016Galiya Warrier
 
An OVERVIEW OF CSR ACTIVITIES PERFORMED BY HDFC BANK
An OVERVIEW OF CSR ACTIVITIES PERFORMED BY HDFC BANKAn OVERVIEW OF CSR ACTIVITIES PERFORMED BY HDFC BANK
An OVERVIEW OF CSR ACTIVITIES PERFORMED BY HDFC BANKVARUN KESAVAN
 
Building a bot with an intent
Building a bot with an intentBuilding a bot with an intent
Building a bot with an intentAbhishek Sur
 

Viewers also liked (10)

Abdelrahman Saed Ibrahim - CV-20112016
Abdelrahman Saed Ibrahim - CV-20112016Abdelrahman Saed Ibrahim - CV-20112016
Abdelrahman Saed Ibrahim - CV-20112016
 
Advanced Java Programming
Advanced Java ProgrammingAdvanced Java Programming
Advanced Java Programming
 
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳,파퍼 직거래,파퍼 후불판매,파퍼 후불구입,
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳,파퍼 직거래,파퍼 후불판매,파퍼 후불구입,첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳,파퍼 직거래,파퍼 후불판매,파퍼 후불구입,
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳,파퍼 직거래,파퍼 후불판매,파퍼 후불구입,
 
cv elredy2016
cv elredy2016cv elredy2016
cv elredy2016
 
Poder ciudadano y electoral
Poder ciudadano y electoralPoder ciudadano y electoral
Poder ciudadano y electoral
 
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳,파퍼 사기없는곳,파퍼 데이트 강간약,파퍼 구조식,
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳,파퍼 사기없는곳,파퍼 데이트 강간약,파퍼 구조식,첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳,파퍼 사기없는곳,파퍼 데이트 강간약,파퍼 구조식,
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳,파퍼 사기없는곳,파퍼 데이트 강간약,파퍼 구조식,
 
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳파퍼 팝니다,파퍼 추천,파퍼 종류,
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳파퍼 팝니다,파퍼 추천,파퍼 종류,첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳파퍼 팝니다,파퍼 추천,파퍼 종류,
첫날밤은 언제해?『cia88-com』카톡:888J 파퍼구입,파퍼 판매,파퍼 구입,파퍼 파는곳파퍼 팝니다,파퍼 추천,파퍼 종류,
 
Chatbots DDD North2016
Chatbots DDD North2016Chatbots DDD North2016
Chatbots DDD North2016
 
An OVERVIEW OF CSR ACTIVITIES PERFORMED BY HDFC BANK
An OVERVIEW OF CSR ACTIVITIES PERFORMED BY HDFC BANKAn OVERVIEW OF CSR ACTIVITIES PERFORMED BY HDFC BANK
An OVERVIEW OF CSR ACTIVITIES PERFORMED BY HDFC BANK
 
Building a bot with an intent
Building a bot with an intentBuilding a bot with an intent
Building a bot with an intent
 

Similar to Discriminators for use in flow-based classification

Traffic Features Extraction and Clustering Analysis for Abnormal Behavior Det...
Traffic Features Extraction and Clustering Analysis for Abnormal Behavior Det...Traffic Features Extraction and Clustering Analysis for Abnormal Behavior Det...
Traffic Features Extraction and Clustering Analysis for Abnormal Behavior Det...Areej Qasrawi
 
Traffic Engineering in Software Defined Networking SDN
Traffic Engineering in Software Defined Networking SDNTraffic Engineering in Software Defined Networking SDN
Traffic Engineering in Software Defined Networking SDNijtsrd
 
Chapter 3. sensors in the network domain
Chapter 3. sensors in the network domainChapter 3. sensors in the network domain
Chapter 3. sensors in the network domainPhu Nguyen
 
Optimal Converge cast Methods for Tree- Based WSNs
Optimal Converge cast Methods for Tree- Based WSNsOptimal Converge cast Methods for Tree- Based WSNs
Optimal Converge cast Methods for Tree- Based WSNsIJMER
 
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATION
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATIONSCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATION
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATIONijdms
 
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATION
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATIONSCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATION
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATIONijdms
 
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATION
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATIONSCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATION
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATIONijdms
 
Performance Evaluation of a Layered WSN Using AODV and MCF Protocols in NS-2
Performance Evaluation of a Layered WSN Using AODV and MCF Protocols in NS-2Performance Evaluation of a Layered WSN Using AODV and MCF Protocols in NS-2
Performance Evaluation of a Layered WSN Using AODV and MCF Protocols in NS-2csandit
 
cns-unit-4-onlinenotepad.io.pdf
cns-unit-4-onlinenotepad.io.pdfcns-unit-4-onlinenotepad.io.pdf
cns-unit-4-onlinenotepad.io.pdfssuser37a73f
 
Analyzing performance of zrp by varying node density and transmission range
Analyzing performance of zrp by varying node density and transmission rangeAnalyzing performance of zrp by varying node density and transmission range
Analyzing performance of zrp by varying node density and transmission rangeAlexander Decker
 
REDUCTION OF MONITORING REGISTERS ON SOFTWARE DEFINED NETWORKS
REDUCTION OF MONITORING REGISTERS ON SOFTWARE DEFINED NETWORKSREDUCTION OF MONITORING REGISTERS ON SOFTWARE DEFINED NETWORKS
REDUCTION OF MONITORING REGISTERS ON SOFTWARE DEFINED NETWORKSijcsit
 
REDUCTION OF MONITORING REGISTERS ON SOFTWARE DEFINED NETWORKS
REDUCTION OF MONITORING REGISTERS ON SOFTWARE DEFINED NETWORKSREDUCTION OF MONITORING REGISTERS ON SOFTWARE DEFINED NETWORKS
REDUCTION OF MONITORING REGISTERS ON SOFTWARE DEFINED NETWORKSAIRCC Publishing Corporation
 
An efficient ant optimized multipath routing in wireless sensor network
An efficient ant optimized multipath routing in wireless sensor networkAn efficient ant optimized multipath routing in wireless sensor network
An efficient ant optimized multipath routing in wireless sensor networkEditor Jacotech
 
Clustering based Time Slot Assignment Protocol for Improving Performance in U...
Clustering based Time Slot Assignment Protocol for Improving Performance in U...Clustering based Time Slot Assignment Protocol for Improving Performance in U...
Clustering based Time Slot Assignment Protocol for Improving Performance in U...journal ijrtem
 
Clustering based Time Slot Assignment Protocol for Improving Performance in U...
Clustering based Time Slot Assignment Protocol for Improving Performance in U...Clustering based Time Slot Assignment Protocol for Improving Performance in U...
Clustering based Time Slot Assignment Protocol for Improving Performance in U...IJRTEMJOURNAL
 
DIFFERENTIATED SERVICES ENSURING QOS ON INTERNET
DIFFERENTIATED SERVICES ENSURING QOS ON INTERNETDIFFERENTIATED SERVICES ENSURING QOS ON INTERNET
DIFFERENTIATED SERVICES ENSURING QOS ON INTERNETijcseit
 
Analysis of data transmission in wireless lan for 802.11
Analysis of data transmission in wireless lan for 802.11Analysis of data transmission in wireless lan for 802.11
Analysis of data transmission in wireless lan for 802.11eSAT Publishing House
 
Analysis of data transmission in wireless lan for 802.11 e2 et
Analysis of data transmission in wireless lan for 802.11 e2 etAnalysis of data transmission in wireless lan for 802.11 e2 et
Analysis of data transmission in wireless lan for 802.11 e2 eteSAT Journals
 
Network Telemetry
Network TelemetryNetwork Telemetry
Network TelemetryAalok Shah
 

Similar to Discriminators for use in flow-based classification (20)

Ba25315321
Ba25315321Ba25315321
Ba25315321
 
Traffic Features Extraction and Clustering Analysis for Abnormal Behavior Det...
Traffic Features Extraction and Clustering Analysis for Abnormal Behavior Det...Traffic Features Extraction and Clustering Analysis for Abnormal Behavior Det...
Traffic Features Extraction and Clustering Analysis for Abnormal Behavior Det...
 
Traffic Engineering in Software Defined Networking SDN
Traffic Engineering in Software Defined Networking SDNTraffic Engineering in Software Defined Networking SDN
Traffic Engineering in Software Defined Networking SDN
 
Chapter 3. sensors in the network domain
Chapter 3. sensors in the network domainChapter 3. sensors in the network domain
Chapter 3. sensors in the network domain
 
Optimal Converge cast Methods for Tree- Based WSNs
Optimal Converge cast Methods for Tree- Based WSNsOptimal Converge cast Methods for Tree- Based WSNs
Optimal Converge cast Methods for Tree- Based WSNs
 
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATION
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATIONSCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATION
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATION
 
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATION
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATIONSCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATION
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATION
 
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATION
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATIONSCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATION
SCALING DISTRIBUTED DATABASE JOINS BY DECOUPLING COMPUTATION AND COMMUNICATION
 
Performance Evaluation of a Layered WSN Using AODV and MCF Protocols in NS-2
Performance Evaluation of a Layered WSN Using AODV and MCF Protocols in NS-2Performance Evaluation of a Layered WSN Using AODV and MCF Protocols in NS-2
Performance Evaluation of a Layered WSN Using AODV and MCF Protocols in NS-2
 
cns-unit-4-onlinenotepad.io.pdf
cns-unit-4-onlinenotepad.io.pdfcns-unit-4-onlinenotepad.io.pdf
cns-unit-4-onlinenotepad.io.pdf
 
Analyzing performance of zrp by varying node density and transmission range
Analyzing performance of zrp by varying node density and transmission rangeAnalyzing performance of zrp by varying node density and transmission range
Analyzing performance of zrp by varying node density and transmission range
 
REDUCTION OF MONITORING REGISTERS ON SOFTWARE DEFINED NETWORKS
REDUCTION OF MONITORING REGISTERS ON SOFTWARE DEFINED NETWORKSREDUCTION OF MONITORING REGISTERS ON SOFTWARE DEFINED NETWORKS
REDUCTION OF MONITORING REGISTERS ON SOFTWARE DEFINED NETWORKS
 
REDUCTION OF MONITORING REGISTERS ON SOFTWARE DEFINED NETWORKS
REDUCTION OF MONITORING REGISTERS ON SOFTWARE DEFINED NETWORKSREDUCTION OF MONITORING REGISTERS ON SOFTWARE DEFINED NETWORKS
REDUCTION OF MONITORING REGISTERS ON SOFTWARE DEFINED NETWORKS
 
An efficient ant optimized multipath routing in wireless sensor network
An efficient ant optimized multipath routing in wireless sensor networkAn efficient ant optimized multipath routing in wireless sensor network
An efficient ant optimized multipath routing in wireless sensor network
 
Clustering based Time Slot Assignment Protocol for Improving Performance in U...
Clustering based Time Slot Assignment Protocol for Improving Performance in U...Clustering based Time Slot Assignment Protocol for Improving Performance in U...
Clustering based Time Slot Assignment Protocol for Improving Performance in U...
 
Clustering based Time Slot Assignment Protocol for Improving Performance in U...
Clustering based Time Slot Assignment Protocol for Improving Performance in U...Clustering based Time Slot Assignment Protocol for Improving Performance in U...
Clustering based Time Slot Assignment Protocol for Improving Performance in U...
 
DIFFERENTIATED SERVICES ENSURING QOS ON INTERNET
DIFFERENTIATED SERVICES ENSURING QOS ON INTERNETDIFFERENTIATED SERVICES ENSURING QOS ON INTERNET
DIFFERENTIATED SERVICES ENSURING QOS ON INTERNET
 
Analysis of data transmission in wireless lan for 802.11
Analysis of data transmission in wireless lan for 802.11Analysis of data transmission in wireless lan for 802.11
Analysis of data transmission in wireless lan for 802.11
 
Analysis of data transmission in wireless lan for 802.11 e2 et
Analysis of data transmission in wireless lan for 802.11 e2 etAnalysis of data transmission in wireless lan for 802.11 e2 et
Analysis of data transmission in wireless lan for 802.11 e2 et
 
Network Telemetry
Network TelemetryNetwork Telemetry
Network Telemetry
 

Discriminators for use in flow-based classification

  • 1. Discriminators for use in flow-based classification ISSN 1470-5559 RR-05-13 updated August 2005 Department of Computer Science Andrew Moore, Denis Zuev and Michael Crogan
  • 2.
  • 3. Discriminators for use in flow-based classification∗ Andrew W. Moore, Queen Mary, University of London, Department of Computer Science, † Denis Zuev, University of Oxford, Mathematical Institute,‡ and Michael L. Crogan§ August 2004, updated August 2005 Abstract Any assessment of classification techniques requires data. This document describes sets of data intended to aid in the assessment of classification work. A number of data sets are described; each data set consists a number of objects, and each object is described by a group of features (also referred to as discriminators). Leveraged by a quantity of hand-classified data, each object within each data set represents a single flow of TCP packets between client and server. The features for each object consist of the (application-centric) classification derived elsewhere and a number of features derived as input to probabilistic classification techniques. In addition to describing the features, we also provide information allowing interested parties to retrieve these data sets for use in their own work. The data sets contain no site-identifying information; each object is only described by a set of statistics and a class that defines the causal application. 1 Introduction This work describes sets of data that we have provided to the research community in order to allow them to assess their classification techniques on real data. The intention of this work is to provide the provenance of the discriminators that describe each object as well as to provide background on the formation of the objects and subsequent limitations of this data set. We created these data sets as training sets to allow assessment of probabilistic classification techniques. The information in the features is derived using packet header information alone, while the classification- class has been derived using a content-based analysis. The content-based classification process is described in [1, 2]. We have used data collected by the high-performance network monitor described in [3]. We use its loss- limited, capture to disk providing timestamps with resolution of better than 35 nanoseconds. The site, B, is a research facility host to about 1,000 users connected to the Internet via a full-duplex Gigabit Ethernet link. Our data is based upon a 24-hour, full-duplex trace of this research facility. Section 2 describes how we subsample the 24-hour period, creating 10 separate data sets each from a different period of the 24-hour day. Section 3 describes how we filter the data of each period to create the sets of objects (TCP flows) we subsequently characterize. Section 4 describes each of the discriminators (features) that are used to characterize the objects of each data set. Section 5 notes the address from which this data may be retrieved, summarizes the features and limitations of these data sets and notes where future work may take us. ∗Contact author: andrew.moore@dcs.qmul.ac.uk †This work was completed when Andrew Moore was supported by a research fellowship from the the Intel Corporation. ‡This work was completed when Denis Zuev was employed by Intel Research, Cambridge. §This work was completed when Michael Crogan was employed by Intel Research, Cambridge. 1
  • 4. 15 45 75 30 60 90 00:00 12:00 18:00 Mbps Time (GMT)06:00 01 02 03 04 05 06 07 08 09 10 Block Number Figure 1: Heuristic illustration of how data blocks were obtained. The line represents the instantaneous bandwidth requirements during the day, while the dark regions represent the data of each data set. Data-set Start-time End-time Duration Flows (Objects) entry01 2003-Aug-20 00:34:21 2003-Aug-20 01:04:43 1821.8 24863 entry02 2003-Aug-20 01:37:37 2003-Aug-20 02:05:54 1696.7 23801 entry03 2003-Aug-20 02:45:19 2003-Aug-20 03:14:03 1724.1 22932 entry04 2003-Aug-20 04:03:31 2003-Aug-20 04:33:15 1784.1 22285 entry05 2003-Aug-20 04:39:10 2003-Aug-20 05:09:05 1794.9 21648 entry06 2003-Aug-20 06:07:28 2003-Aug-20 06:35:06 1658.5 19384 entry07 2003-Aug-20 09:42:17 2003-Aug-20 10:11:16 1739.2 55835 entry08 2003-Aug-20 11:52:40 2003-Aug-20 12:20:26 1665.9 55494 entry09 2003-Aug-20 13:45:37 2003-Aug-20 14:13:21 1664.5 66248 entry10 2003-Aug-20 14:55:44 2003-Aug-20 15:22:37 1613.4 65036 Table 1: Broad statistics of each data set. 2 Data sets In order to construct the sets of flows, the day trace was split into ten blocks of approximately 1680 seconds (28 minutes) each. In order to provide a wider sample of mixing across the day, the start of each sample was selected randomly (uniformly distributed over the whole day trace). Figure 1 illustrates heuristically our technique. It can be seen from Table 1 that there are a different number of flows in each data block, due to a variable density of traffic during each constant period. Since time statistics of flows are present in the analysis, we consider it to be important to keep a fixed time window when selecting flows. Each data set represents a period of time taken from within the day. Table 1 provides a list of the data sets along with information about their duration and the number of flows present in each data set. While each of the data sets represent approximately the same period of time, the number of objects per data set fluctuates as a result of the variation in activity throughout the course of the day. While the start times of the ten data sets were selected from a random uniform distribution, there is a clear bias toward the first 16 hours of the day. As noted in Future work the discriminator-based characterization is planned for all flows over the whole 24-hour period thereby removing this bias from the data. 2
  • 5. 3 Flow Definition Each data set is represented as a text file, and consists of multiple lines. Each line represents an object (flow). In the Internet a flow may be defined as one or more packets traveling between two computer addresses using a particular protocol (e.g., TCP, UDP, ICMP)—and where appropriate a particular pair of ports (defined for each end of the flow). This tuple of information (hostsrc, hostdest, portsrc, portdest, protocol) is present in every packet. Source and destination elements in the tuple will be reversed for packets traveling in the opposite direction. In this way a stream of packets may be either half- or full-duplex. In order to simplify our definitions, we elected to concentrate upon the TCP protocol and TCP flows only, UDP data is planned in future work. TCP is a stateful protocol and its flows have a well-defined beginning and end, and subsequently do not lend themselves to the unclear start-end definitions that may plague a flow of data consisting of individual datagrams alone (such as UDP). In the simplest case, a complete flow is well defined when both a complete flow setup and tear-down are observed. The complexity in any flow definition occurs when the setup is incomplete or the tear-down is abnormal. A set of rules is embodied into netdude [4] that implements the TCP state engine in a sufficiently robust fashion even in the face of packet loss (as might occur in a network monitoring system). We use netdude to create a collection of complete TCP flows. Complete TCP flows have a number of desirable properties. For example, using complete flows allows us to differentiate client from server, and this allows us to reliably identify the client and server ports. Additionally, complete TCP flows allow us to compute nearly all the discriminators provided in each data set. 4 Discriminators This data set is intended to provide a wide variety of features to characterise flows. This includes simple statistics about packet length and inter-packet timings, and information derived from the transport protocol (TCP): such as SYN and ACK counts. This information is provided based on all packets (both directions) and on each direction individually (server → client and client → server). Many packet statistics are derived directly by counting packets, and packet header-sizes. A significant number of features (such as estimates of round-trip time, size of TCP segments, and the total number of retransmissions) are derived from the TCP headers — we use tcptrace[5] for this information. We describe each flow using three modes: idle : no packets between client and server for greater-than or equal-to two seconds, interactive : data packets moving in both directions, and bulk : data packets in one direction and only acknowledgments in the other. We provide features of the flow duration, the time a flow spends in the bulk mode, and the time spent by the flow in the idle mode. The effective bandwidth utilisation is a computation of entropy that provides an insight into the activity of a flow. Such an entropy measure may be computed over a number of different time-scales as a result the values provided are single points in a continuum. Finally, from the inter-arrival time of packets (in both directions and each direction individually) we also provide a ranked list of the ten frequency components that contribute the most to the Fourier transform and the effective bandwidth utilisation 1 1It may be argued that frequency components based upon binning would be more useful than a ranked list — we plan to add that in the future. 3
  • 6. Table 2: Discriminators and Definitions Number Short Long 1 Server Port Port Number at server; we can establish server and client ports as we limit ourselves to flows for which we see the initial connection set-up. 2 Client Port Port Number at client 3 min IAT Minimum packet inter-arrival time for all packets of the flow (considering both directions). 4 q1 IAT First quartile inter-arrival time 5 med IAT Median inter-arrival time 6 mean IAT Mean inter-arrival time 7 q3 IAT Third quartile packet inter-arrival time 8 max IAT Maximum packet inter-arrival time 9 var IAT Variance in packet inter-arrival time 10 min data wire Minimum of bytes in (Ethernet) packet, using the size of the packet on the wire. 11 q1 data wire First quartile of bytes in (Ethernet) packet 12 med data wire Median of bytes in (Ethernet) packet 13 mean data wire Mean of bytes in (Ethernet) packet 14 q3 data wire Third quartile of bytes in (Ethernet) packet 15 max data wire Maximum of bytes in (Ethernet) packet 16 var data wire Variance of bytes in (Ethernet) packet 17 min data ip Minimum of total bytes in IP packet, using the size of payload declared by the IP packet 18 q1 data ip First quartile of total bytes in IP packet 19 med data ip Median of total bytes in IP packet 20 mean data ip Mean of total bytes in IP packet 21 q3 data ip Third quartile of total bytes in IP packet 22 max data ip Maximum of total bytes in IP packet 23 var data ip Variance of total bytes in IP packet 24 min data control Minimum of control bytes in packet, size of the (IP/TCP) packet header 25 q1 data control First quartile of control bytes in packet 26 med data control Median of control bytes in packet 27 mean data control Mean of control bytes in packet 28 q3 data control Third quartile of control bytes in packet 29 max data control Maximum of control bytes in packet 30 var data control Variance of control bytes packet 31 total packets a b The total number of packets seen (client→server). 32 total packets b a ” (server→client) Continued on next page 4
  • 7. Number Short Long 33 ack pkts sent a b The total number of ack packets seen (TCP segments seen with the ACK bit set) (client→server). 34 ack pkts sent b a ” (server→client) 35 pure acks sent a b The total number of ack packets seen that were not piggy-backed with data (just the TCP header and no TCP data payload) and did not have any of the SYN/FIN/RST flags set (client→server) 36 pure acks sent b a ” (server→client) 37 sack pkts sent a b The total number of ack packets seen carrying TCP SACK [6] blocks (client→server) 38 sack pkts sent b a ” (server→client) 39 dsack pkts sent a b The total number of sack packets seen that carried duplicate SACK (D-SACK) [7] blocks. (client→server) 40 dsack pkts sent b a ” (server→client) 41 max sack blks/ack a b The maximum number of sack blocks seen in any sack packet. (client→server) 42 max sack blks/ack b a ” (server→client) 43 unique bytes sent a b The number of unique bytes sent, i.e., the total bytes of data sent excluding retransmitted bytes and any bytes sent doing window probing. (client→server) 44 unique bytes sent b a ” (server→client) 45 actual data pkts a b The count of all the packets with at least a byte of TCP data payload. (client→server) 46 actual data pkts b a ” (server→client) 47 actual data bytes a b The total bytes of data seen. Note that this includes bytes from retransmissions / window probe packets if any. (client→server) 48 actual data bytes b a ” (server→client) 49 rexmt data pkts a b The count of all the packets found to be retransmis- sions. (client→server) 50 rexmt data pkts b a ” (server→client) 51 rexmt data bytes a b The total bytes of data found in the retransmitted packets. (client→server) 52 rexmt data bytes b a ” (server→client) 53 zwnd probe pkts a b The count of all the window probe packets seen. (Win- dow probe packets are typically sent by a sender when the receiver last advertised a zero receive window, to see if the window has opened up now). (client→server) 54 zwnd probe pkts b a ” (server→client) 55 zwnd probe bytes a b The total bytes of data sent in the window probe pack- ets. (client→server) 56 zwnd probe bytes b a ” (server→client) 57 outoforder pkts a b The count of all the packets that were seen to arrive out of order. (client→server) Continued on next page 5
  • 8. Number Short Long 58 outoforder pkts b a ” (server→client) 59 pushed data pkts a b The count of all the packets seen with the PUSH bit set in the TCP header. (client→server) 60 pushed data pkts b a ” (server→client) 61 SYN pkts sent a b The count of all the packets seen with the SYN bits set in the TCP header respectively (client→server) 62 FIN pkts sent a b The count of all the packets seen with the FIN bits set in the TCP header respectively (client→server) 63 SYN pkts sent b a The count of all the packets seen with the SYN bits set in the TCP header respectively (server→client) 64 FIN pkts sent b a The count of all the packets seen with the FIN bits set in the TCP header respectively (server→client) 65 req 1323 ws a b If the endpoint requested Window Scaling/Time Stamp options as specified in RFC 1323[8] a ‘Y’ is printed on the respective field. If the option was not requested, an ‘N’ is printed. For example, an “N/Y” in this field means that the window-scaling option was not specified, while the Time-stamp option was speci- fied in the SYN segment. (client→server) 66 req 1323 ts a b . . . 67 req 1323 ws b a If the endpoint requested Window Scaling/Time Stamp options as specified in RFC 1323[8] a ‘Y’ is printed on the respective field. If the option was not requested, an ‘N’ is printed. For example, an “N/Y” in this field means that the window-scaling option was not specified, while the Time-stamp option was speci- fied in the SYN segment. (client→server) 68 req 1323 ts b a . . . 69 adv wind scale a b The window scaling factor used. Again, this field is valid only if the connection was captured fully to in- clude the SYN packets. Since the connection would use window scaling if and only if both sides requested window scaling [8], this field is reset to 0 (even if a window scale was requested in the SYN packet for this direction), if the SYN packet in the reverse direction did not carry the window scale option. (client→server) 70 adv wind scale b a ” (server→client) 71 req sack a b If the end-point sent a SACK permitted option in the SYN packet opening the connection, a ‘Y’ is printed; otherwise ‘N’ is printed. (client→server) 72 req sack b a ” (server→client) 73 sacks sent a b The total number of ACK packets seen carrying SACK information. (client→server) 74 sacks sent b a ” (server→client) Continued on next page 6
  • 9. Number Short Long 75 urgent data pkts a b The total number of packets with the URG bit turned on in the TCP header. (client→server) 76 urgent data pkts b a ” (server→client) 77 urgent data bytes a b The total bytes of urgent data sent. This field is cal- culated by summing the urgent pointer offset values found in packets having the URG bit set in the TCP header. (client→server) 78 urgent data bytes b a ” (server→client) 79 mss requested a b The Maximum Segment Size (MSS) requested as a TCP option in the SYN packet opening the connec- tion. (client→server) 80 mss requested b a ” (server→client) 81 max segm size a b The maximum segment size observed during the life- time of the connection. (client→server) 82 max segm size b a ” (server→client) 83 min segm size a b The minimum segment size observed during the life- time of the connection. (client→server) 84 min segm size b a ” (server→client) 85 avg segm size a b The average segment size observed during the lifetime of the connection calculated as the value reported in the actual data bytes field divided by the actual data pkts reported. (client→server) 86 avg segm size b a ” (server→client) 87 max win adv a b The maximum window advertisement seen. If the con- nection is using window scaling (both sides negoti- ated window scaling during the opening of the con- nection), this is the maximum window-scaled adver- tisement seen in the connection. For a connection us- ing window scaling, both the SYN segments opening the connection have to be captured in the dumpfile for this and the following window statistics to be accurate. (client→server) 88 max win adv b a ” (server→client) 89 min win adv a b The minimum window advertisement seen. This is the minimum window-scaled advertisement seen if both sides negotiated window scaling. (client→server) 90 min win adv b a ” (server→client) 91 zero win adv a b The number of times a zero receive window was adver- tised. (client→server) 92 zero win adv b a ” (server→client) 93 avg win adv a b The average window advertisement seen, calculated as the sum of all window advertisements divided by the total number of packets seen. If the connection endpoints negotiated window scaling, this average is calculated as the sum of all window-scaled advertise- ments divided by the number of window-scaled packets seen. Note that in the window-scaled case, the win- dow advertisements in the SYN packets are excluded since the SYN packets themselves cannot have their window advertisements scaled, as per RFC 1323 [8]. (client→server) Continued on next page 7
  • 10. Number Short Long 94 avg win adv b a ” (server→client) 95 initial window-bytes a b The total number of bytes sent in the initial window i.e., the number of bytes seen in the initial flight of data before receiving the first ack packet from the other endpoint. Note that the ack packet from the other endpoint is the first ack acknowledging some data (the ACKs part of the 3-way handshake do not count), and any retransmitted packets in this stage are excluded. (client→server) 96 initial window-bytes b a ” (server→client) 97 initial window-packets a b The total number of segments (packets) sent in the initial window as explained above. (client→server) 98 initial window-packets b a ” (server→client) 99 ttl stream length a b The Theoretical Stream Length. This is calculated as the difference between the sequence numbers of the SYN and FIN packets, giving the length of the data stream seen. Note that this calculation is aware of sequence space wrap-arounds, and is printed only if the connection was complete (both the SYN and FIN packets were seen). (client→server) 100 ttl stream length b a ” (server→client) 101 missed data a b The missed data, calculated as the difference be- tween the ttl stream length and unique bytes sent. If the connection was not complete, this calculation is invalid and an “NA” (Not Available) is printed. (client→server) 102 missed data b a ” (server→client) 103 truncated data a b The truncated data, calculated as the total bytes of data truncated during packet capture. For example, with tcpdump, the snaplen option can be set to 64 (with -s option) so that just the headers of the packet (assuming there are no options) are captured, truncat- ing most of the packet data. In an Ethernet with max- imum segment size of 1500 bytes, this would amount to truncated data of 1500 64 = 1436bytes for a packet. (client→server) 104 truncated data b a ” (server→client) 105 truncated packets a b The total number of packets truncated as explained above. (client→server) 106 truncated packets b a ” (server→client) 107 data xmit time a b Total data transmit time, calculated as the differ- ence between the times of capture of the first and last packets carrying non-zero TCP data payload. (client→server) Continued on next page 8
  • 11. Number Short Long 108 data xmit time b a ” (server→client) 109 idletime max a b Maximum idle time, calculated as the maximum time between consecutive packets seen in the direction. (client→server) 110 idletime max b a ” (server→client) 111 throughput a b The average throughput calculated as the unique bytes sent divided by the elapsed time i.e., the value reported in the unique bytes sent field divided by the elapsed time (the time difference between the capture of the first and last packets in the direction). (client→server) 112 throughput b a ” (server→client) 113 RTT samples a b The total number of Round-Trip Time (RTT) sam- ples found. tcptrace is pretty smart about choosing only valid RTT samples. An RTT sample is found only if an ack packet is received from the other end- point for a previously transmitted packet such that the acknowledgment value is 1 greater than the last sequence number of the packet. Further, it is required that the packet being acknowledged was not retrans- mitted, and that no packets that came before it in the sequence space were retransmitted after the packet was transmitted. Note : The former condition invali- dates RTT samples due to the retransmission ambigu- ity problem, and the latter condition invalidates RTT samples since it could be the case that the ack packet could be cumulatively acknowledging the retransmit- ted packet, and not necessarily ack-ing the packet in question. (client→server) 114 RTT samples b a ” (server→client) 115 RTT min a b The minimum RTT sample seen. (client→server) 116 RTT min b a ” (server→client) 117 RTT max a b The maximum RTT sample seen. (client→server) 118 RTT max b a ” (server→client) 119 RTT avg a b The average value of RTT found, calculated straightforward-ly as the sum of all the RTT values found divided by the total number of RTT samples. (client→server) 120 RTT avg b a ” (server→client) 121 RTT stdv a b The standard deviation of the RTT samples. (client→server) 122 RTT stdv b a ” (server→client) 123 RTT from 3WHS a b The RTT value calculated from the TCP 3-Way Hand-Shake (connection opening) [9], assuming that the SYN packets of the connection were captured. (client→server) Continued on next page 9
  • 12. Number Short Long 124 RTT from 3WHS b a ” (server→client) 125 RTT full sz smpls a b The total number of full-size RTT samples, calculated from the RTT samples of full-size segments. Full-size segments are defined to be the segments of the largest size seen in the connection. (client→server) 126 RTT full sz smpls b a ” (server→client) 127 RTT full sz min a b The minimum full-size RTT sample. (client→server) 128 RTT full sz min b a ” (server→client) 129 RTT full sz max a b The maximum full-size RTT sample. (client→server) 130 RTT full sz max b a ” (server→client) 131 RTT full sz avg a b The average full-size RTT sample. (client→server) 132 RTT full sz avg b a ” (server→client) 133 RTT full sz stdev a b The standard deviation of full-size RTT samples. (client→server) 134 RTT full sz stdev b a ” (server→client) 135 post-loss acks a b The total number of ack packets received after losses were detected and a retransmission occurred. More precisely, a post-loss ack is found to occur when an ack packet acknowledges a packet sent (acknowledg- ment value in the ack pkt is 1 greater than the packet’s last sequence number), and at least one packet occur- ring before the packet acknowledged, was retransmit- ted later. In other words, the ack packet is received after we observed a (perceived) loss event and are re- covering from it. (client→server) 136 post-loss acks b a ” (server→client) 137 segs cum acked a b The count of the number of segments that were cumu- latively acknowledged and not directly acknowledged. (client→server) 138 segs cum acked b a ” (server→client) 139 duplicate acks a b The total number of duplicate acknowledgments re- ceived. (client→server) 140 duplicate acks b a ” (server→client) 141 triple dupacks a b The total number of triple duplicate acknowledgments received (three duplicate acknowledgments acknowl- edging the same segment), a condition commonly used to trigger the fast-retransmit/fast-recovery phase of TCP. (client→server) 142 triple dupacks b a ” (server→client) 143 max # retrans a b The maximum number of retransmissions seen for any segment during the lifetime of the connection. (client→server) Continued on next page 10
  • 13. Number Short Long 144 max # retrans b a ” (server→client) 145 min retr time a b The minimum time seen between any two (re)transmissions of a segment amongst all the retransmissions seen. (client→server) 146 min retr time b a ” (server→client) 147 max retr time a b The maximum time seen between any two (re)transmissions of a segment. (client→server) 148 max retr time b a ” (server→client) 149 avg retr time a b The average time seen between any two (re)transmissions of a segment calculated from all the retransmissions. (client→server) 150 avg retr time b a ” (server→client) 151 sdv retr time a b The standard deviation of the retransmission-time samples obtained from all the retransmissions. (client→server) 152 sdv retr time b a ” (server→client) 153 min data wire a b Minimum number of bytes in (Ethernet) packet (client→server) 154 q1 data wire a b First quartile of bytes in (Ethernet) packet 155 med data wire a b Median of bytes in (Ethernet) packet 156 mean data wire a b Mean of bytes in (Ethernet) packet 157 q3 data wire a b Third quartile of bytes in (Ethernet) packet 158 max data wire a b Maximum of bytes in (Ethernet) packet 159 var data wire a b Variance of bytes in (Ethernet) packet 160 min data ip a b Minimum number of total bytes in IP packet 161 q1 data ip a b First quartile of total bytes in IP packet 162 med data ip a b Median of total bytes in IP packet 163 mean data ip a b Mean of total bytes in IP packet 164 q3 data ip a b Third quartile of total bytes in IP packet 165 max data ip a b Maximum of total bytes in IP packet 166 var data ip a b Variance of total bytes in IP packet 167 min data control a b Minimum of control bytes in packet 168 q1 data control a b First quartile of control bytes in packet 169 med data control a b Median of control bytes in packet 170 mean data control a b Mean of control bytes in packet 171 q3 data control a b Third quartile of control bytes in packet 172 max data control a b Maximum of control bytes in packet 173 var data control a b Variance of control bytes packet 174 min data wire b a Minimum number of bytes in (Ethernet) packet (server→client) 175 q1 data wire b a First quartile of bytes in (Ethernet) packet 176 med data wire b a Median of bytes in (Ethernet) packet 177 mean data wire b a Mean of bytes in (Ethernet) packet Continued on next page 11
  • 14. Number Short Long 178 q3 data wire b a Third quartile of bytes in (Ethernet) packet 179 max data wire b a Maximum of bytes in (Ethernet) packet 180 var data wire b a Variance of bytes in (Ethernet) packet 181 min data ip b a Minimum number of total bytes in IP packet 182 q1 data ip b a First quartile of total bytes in IP packet 183 med data ip b a Median of total bytes in IP packet 184 mean data ip b a Mean of total bytes in IP packet 185 q3 data ip b a Third quartile of total bytes in IP packet 186 max data ip b a Maximum of total bytes in IP packet 187 var data ip b a Variance of total bytes in IP packet 188 min data control b a Minimum of control bytes in packet 189 q1 data control b a First quartile of control bytes in packet 190 med data control b a Median of control bytes in packet 191 mean data control b a Mean of control bytes in packet 192 q3 data control b a Third quartile of control bytes in packet 193 max data control b a Maximum of control bytes in packet 194 var data control b a Variance of control bytes packet 195 min IAT a b Minimum of packet inter-arrival time (client→server) 196 q1 IAT a b First quartile of packet inter-arrival time 197 med IAT a b Median of packet inter-arrival time 198 mean IAT a b Mean of packet inter-arrival time 199 q3 IAT a b Third quartile of packet inter-arrival time 200 max IAT a b Maximum of packet inter-arrival time 201 var IAT a b Variance of packet inter-arrival time 202 min IAT b a Minimum of packet inter-arrival time (server→client) 203 q1 IAT b a First quartile of packet inter-arrival time 204 med IAT b a Median of packet inter-arrival time 205 mean IAT b a Mean of packet inter-arrival time 206 q3 IAT b a Third quartile of packet inter-arrival time 207 max IAT b a Maximum of packet inter-arrival time 208 var IAT b a Variance of packet inter-arrival time 209 Time since last connection Time since the last connection between these hosts 210 No. transitions bulk/trans The number of transitions between transaction mode and bulk transfer mode, where bulk transfer mode is defined as the time when there are more than three successive packets in the same direction without any packets carrying data in the other direction 211 Time spent in bulk Amount of time spent in bulk transfer mode 212 Duration Connection duration 213 % bulk Percent of time spent in bulk transfer 214 Time spent idle The time spent idle (where idle time is the accumu- lation of all periods of 2 seconds or greater when no packet was seen in either direction) Continued on next page 12
  • 15. Number Short Long 215 % idle Percent of time spent idle 216 Effective Bandwidth Effective Bandwidth based upon entropy [10] (both directions) 217 Effective Bandwidth a b ” (client→server) 218 Effective Bandwidth b a ” (server→client) 219 FFT all FFT of packet IAT (arctan of the top-ten frequencies ranked by the magnitude of their contribution) (all traffic) (Frequency #1) 220 FFT all ” (Frequency #2) 221 FFT all ” . . . 222 FFT all ” . . . 223 FFT all ” . . . 224 FFT all ” . . . 225 FFT all ” . . . 226 FFT all ” . . . 227 FFT all ” . . . 228 FFT all ” (Frequency #10) 229 FFT a b FFT of packet IAT (arctan of the top-ten frequen- cies ranked by the magnitude of their contribution) (client→server) (Frequency #1) 230 FFT a b ” (Frequency #2) 231 FFT a b ” . . . 232 FFT a b ” . . . 233 FFT a b ” . . . 234 FFT a b ” . . . 235 FFT a b ” . . . 236 FFT a b ” . . . 237 FFT a b ” . . . 238 FFT b a ” (Frequency #10) 239 FFT b a FFT of packet IAT (arctan of the top-ten frequen- cies ranked by the magnitude of their contribution) (server→client) (Frequency #1) 240 FFT b a ” (Frequency #2) 241 FFT b a ” . . . 242 FFT b a ” . . . 243 FFT b a ” . . . 244 FFT b a ” . . . 245 FFT b a ” . . . 246 FFT b a ” . . . 247 FFT b a ” . . . 248 FFT b a ” (Frequency #10) 249 Classes Application class, as assigned in [1] 5 Conclusion The pre-computed discriminator-data sets are available off-of http://www.dcs.qmul.ac.uk/research/nrl/ and http://www.cl.cam.ac.uk/Research/SRG/netos/ nprobe/data/papers/sigmetrics/index.html. While we make every effort to ensure they are without flaw — and that these archives are maintained — they are provided on an as-is basis. Additionally, scripts/code to allow the community to generate discriminators for their own data are available from the above web-sites. 13
  • 16. Future Work It is the intention of the authors to provide sets of discriminators and input data for other traffic as it becomes available. Thanks We thank Matt Roughan for some illuminating conversations pertaining to this work and we are indebted to a number of tool makers notably, everyone who made tcpdump/libpcap the backbone to our research it is today, Christian Kreibich for his netdude tools and Shawn Ostermann for the tcptrace toolkit. References [1] A. Moore and Konstantina Papagiannaki. Toward the accurate identification of network applications, 2005. http://www.cl.cam.ac.uk/ awm22/publication/moore2005toward.pdf. [2] A. W. Moore. Discrete content-based classification — a data set. Technical report, Intel Research, Cambridge, 2005. [3] Andrew Moore, James Hall, Christian Kreibich, Euan Harris, and Ian Pratt. Architecture of a Network Monitor. In Passive & Active Measurement Workshop 2003 (PAM2003), April 2003. [4] Sourceforge. netdude, 2005. http://netdude.sourceforge.net. [5] Shawn Ostermann. tcptrace, 2003. http://www.tcptrace.org. [6] M. Mathis and J. Mahdavi and S. Floyd and A. Romanow. TCP Selective Acknowledgement Options, October 1996. M.MATHIS, J.MAHDAVI, S.FLOYD, AND A.ROMANOW. TCP Selective Acknowl- edgement Options, October 1996. [7] S. Floyd and J. Mahdavi and M. Mathis and M. Podolsky. An Extension to the Selective Acknowledge- ment (SACK) Option for TCP, July 2000. S.FLOYD, J.MAHDAVI, M.MATHIS, AND M.PODOLSKY. An Extension to the Selective Acknowledgement (SACK) Option for TCP, July 2000. [8] T. Berners-Lee and R. Fielding and H. Frystyk. Hypertext Transfer Protocol – HTTP/1.0, May 1996. T.BERNERS-LEE, R.FIELDING, AND H.FRYSTYK. Hypertext Transfer Protocol - HTTP/1.0, May 1996. [9] J. Postel. Transmission Control Protocol, September 1981. INFORMATION SCIENCES INSTITUTE, U. O. S. C. Transmission Control Protocol, September 1981. [10] N. G. Duffield, J. T. Lewis, N. O’Connell, R. Russell, and F. Toomey. Entropy of ATM traffic streams. IEEE Journal on Selected Areas in Communications, 13(6):981–990, August 1995. 14