2. Overview :
● What is “KDD Cup-1999” data set(s) ?
● Data redundancy
● Types of attack
● Data partitioning
● Imbalance data set(s)
● Results
● Conclusion
● References
3. What is “KDD Cup-1999” data set(s) ?
KDD Cup 1999 : “Computer Network Intrusion Detection” problem.
[ intrusion = unauthorized user(s) ]
Records : 4,898,431 ( around 5 millions ) in “train data set” &
311,027 in “test data set”.
Features : 41 ( & a class, which consists 23 attributes. )
7. Imbalance data set(s)
[ K. Leung et al. ]
Sub-data sets : 4, 5, 6 & 7 are all is “smurt” & 8 is all is “neptune”.
8. [K. Lung et al. ] observed :
1. Around 78% “train data” are duplicant &
2. Around 75% “test data” are duplicant.
[ Portnoy et al. ] observed :
The distribution of this data set(s) are very uneven which made cross-validation
difficult.
10. Code :
Drawing comparing barplot ( in R) : https://goo.gl/KqZsMM
Sample Code ( in Python ) : https://goo.gl/O4FjRT
Sample Code ( in Java ) : https://goo.gl/0ZSOJY
11. Conclusion :
[ Tavallaee et al. ] claims that the data set(s) have some problems.
(Such as : Data redundancy, high accuracy rate, highly imbalanced etc. )
So, they proposed new data set(s) name “NSL-KDD”.
Though, McHugh claims that “NSL-KDD” may not be a perfect
representative of existing real networks, because of the lack of
public datasets for network-based IDSs.
12. References :
1. [ Tavallaee et al. ] “A Detailed Analysis of the KDD CUP 99 Data Set”
2. [ J. McHugh ] “Testing intrusion detection systems: a critique of the 1998
and 1999 darpa intrusion detection system evaluations as performed by
lincoln laboratory”.
3. [ K. Leung et al. ] “Unsupervised anomaly detection in network intrusion
detection using clusters”
4. Dr. Dewan Md. Farid lecture. ( CSE 6011 & CSI 415 )
5. UC Irvine Machine Learning Repository
6. WEKA Team ( Evaluate Performance )
7. Python packages : “Pandas”, “Sci-Kit learn”
8. R packages : “ggplot2”