2. NOVEL FEATURE ENGINEERING FOR
MALWARE TRAFFIC DATA MINING:
FLOW
PACKET LENGTH SEQUENCE:
step 1 step 2 step 3 step 4 step 5
0 -> 40 -> 540 -> 48 -> 40 -> 100
Toggle 1 Toggle 2 Toggle 3
(forward) (backward) (forward)
Existing Packet Length(PL) Features per flow:
• Minimum PL: 40
• Maximum PL: 540
• Mean PL: 154
• Standard Deviation: 194
Novel Packet Length(PL) Features per flow:
• Step count in forward direction:
Min – 1, Max – 2, Mean – 1, Std. deviation – 0
• Step count in backward direction:
Min – 2, Max – 2, Mean - 2, Std. deviation – 0
• Forward toggle count : 2
• Backward toggle count: 1
• Number of unique packet lengths: 4
3. Patterns found in K-MEANS:
Average distance within the
cluster
Clustering with
existing flow
attributes
Clustering with
newly added
flow attributes
cluster_1 1.110 0.312
cluster_2 11.443 2.708
cluster_3 13.301 11.487
cluster_4 20.682 22.733
cluster_5 NA 2.327
cluster_6 NA 13.092
cluster_7 NA 7.638
cluster_8 NA 9.858
Average 3.722 3.143
Davies Bouldin Index 0.859 0.839
Index Cluster ID Absolute
count
Fraction
1 cluster_0 38028 0.792
2 cluster_1 5122 0.10
3 cluster_2 3111 0.065
4 cluster_3 1769 0.037
Clustering with existing flow attributes
Index Cluster ID Absolute count Fraction
1 cluster_0 29093 0.606
2 cluster_1 7940 0.165
3 cluster_6 3834 0.0798
4 cluster_5 2316 0.048
5 cluster_7 1686 0.035
6 cluster_3 1344 0.028
7 cluster_2 1023 0.021
8 cluster_4 794 0.0167
Clustering with newly added flow attributes
4. Patterns found in DBSCAN:
Benign Dataset:
Estimated number of clusters: 3
Estimated number of noise points:
1666
Silhouette Coefficient: 0.734
Malware Dataset:
Estimated number of clusters: 8
Estimated number of noise points:
3724
Silhouette Coefficient: 0.576
5. Open questions/issues
Any better techniques to identify the optimal number of clusters for
KMEANS and the optimal epsilon value for DBSCAN?
How to identify the best standardization technique for our dataset?
Is supervised learning a better approach in this context?
7. Difficulties involved:
Hardware limitations of Switches and Routers.
Privacy concerns.
Traffic encryption.
Smart Malware creators
8. FEATURES (PER FLOW):
Packet length based statistics per network flow in both directions.
Network flow- Sequence of packets from a particular source to a
particular destination.
8 existing packet length features extracted by Netmate: Minimum,
Maximum, Mean and Standard deviation of Packet lengths in forward and
backward directions.
Source
IP
Src
port
Dest
IP
Dest
port
Protocol Packet length statistics
172.16.5.203 49158 172.16.5.5 88 6 40,105,288,122,40,119,346,151