Published on

  • Be the first to comment

  • Be the first to like this


  1. 1. Survey Presentation on Four Selected Research Papers on Data Mining Based Intrusion Detection System 60-564: Security and Privacy on the Internet Instructor: Dr. A. K. Aggarwal Presented By: Ahmedur Rahman Zillur Rahman Lawangeen Khan Date: April 05, 2006
  2. 2. Selected Research Papers <ul><li>Exploiting Efficient Data Mining Techniques to Enhance Intrusion Detection Systems </li></ul><ul><li>Packet Vs Session Based Modeling For Intrusion Detection System </li></ul><ul><li>Detecting Denial-of-Service Attacks with Incomplete Audit Data </li></ul><ul><li>ADAM: Detecting Intrusions by Data Mining </li></ul>
  3. 3. Table of Contents <ul><li>Introduction </li></ul><ul><li>Paper 1 </li></ul><ul><li>Paper 2 </li></ul><ul><li>Paper 3 </li></ul><ul><li>Paper 4 </li></ul><ul><li>Testing Methodology </li></ul><ul><li>Conclusion </li></ul><ul><li>Bibliography </li></ul>
  4. 4. Introduction <ul><li>Intrusion detection is a process of gathering intrusion related knowledge occurring in the process of monitoring the events and analyzing them for sign or intrusion. </li></ul><ul><li>Detecting the intrusion based on two common practices – Misuse detection and Anomaly detection. </li></ul><ul><li>To apply data mining techniques in intrusion detection, first, the collected data needs to be preprocessed and converted to the format suitable for mining processing. Next, the reformatted data will be used to develop a clustering or classification model. </li></ul>
  5. 5. Introduction <ul><li>Paper -1 discusses about various types of data mining based intrusion detection system models. </li></ul><ul><li>Paper -2 discusses on packet-based vs. session based modeling for intrusion detection. </li></ul><ul><li>Paper -3 discusses on a model named SCAN, which can work with good accuracy even with incomplete dataset. </li></ul><ul><li>Paper -4 discusses about ADAM, which was one of the leading data mining based intrusion detection system. </li></ul>Cont.
  6. 6. Paper - 1 <ul><li>The main motivation behind using intrusion detection in data mining is automation. Pattern of the normal behavior and pattern of the intrusion can be computed using data mining. </li></ul><ul><li>Two Steps for applying data mining technique to intrusion detection: </li></ul><ul><ul><li>The collected monitoring data needs to be preprocessed and converted to the format suitable for mining processing. </li></ul></ul><ul><ul><li>The reformatted data will be used to develop a clustering or classification model. </li></ul></ul><ul><li>Data mining is helpful in detecting new vulnerabilities and intrusions, discover previous unknown patterns of attacker behaviors, and provide decision support for intrusion management. </li></ul>
  7. 7. Paper - 1 <ul><li>Data Mining and Intrusion Detection </li></ul><ul><ul><li>Different data mining approaches are frequently used to analyze network data to gain intrusion related knowledge. </li></ul></ul><ul><ul><ul><li>Clustering. </li></ul></ul></ul><ul><ul><ul><li>Classification. </li></ul></ul></ul><ul><ul><ul><li>Outlier Detection. </li></ul></ul></ul><ul><ul><ul><li>Association Rule. </li></ul></ul></ul>Cont.
  8. 8. Paper - 1 <ul><li>Clustering </li></ul><ul><ul><li>Most of the clustering techniques use some basic steps involved in identifying intrusion. These steps are as follows: </li></ul></ul><ul><ul><ul><li>Find the largest cluster, i.e., the one with the most number of instances, and label it normal. </li></ul></ul></ul><ul><ul><ul><li>Sort the remaining clusters in an ascending order of their distances to the largest cluster. </li></ul></ul></ul><ul><ul><ul><li>Select the first K1 clusters so that the number of data instances in these clusters sum up to ¼ ´N, and label them as normal, where ´ is the percentage of normal instances. </li></ul></ul></ul><ul><ul><ul><li>Label all the other clusters as attacks. </li></ul></ul></ul>Cont.
  9. 9. Paper - 1 <ul><li>Classification </li></ul><ul><ul><li>Classification is similar to clustering in that it also partitions customer records into distinct segments called classes. </li></ul></ul><ul><ul><li>Unlike clustering, classification analysis requires that the end-user/analyst know ahead of time how classes are defined. </li></ul></ul><ul><ul><li>Classifications algorithms can be classified into three types: </li></ul></ul><ul><ul><ul><li>Extensions to linear discrimination. </li></ul></ul></ul><ul><ul><ul><li>Decision tree </li></ul></ul></ul><ul><ul><ul><li>Rule-based methods </li></ul></ul></ul>Cont.
  10. 10. Paper - 1 <ul><li>Association </li></ul><ul><ul><li>The Association rule is specifically designed for use in data analyses. </li></ul></ul><ul><ul><li>The objective behind using association rule based data mining is to derive multi-feature (attribute) correlations from a database table. </li></ul></ul><ul><ul><li>Association rule algorithms classified into two categories: </li></ul></ul><ul><ul><ul><li>Candidate-generation-and-test approach such as Apriori. </li></ul></ul></ul><ul><ul><ul><li>Pattern-growth approach. </li></ul></ul></ul><ul><ul><li>Basic steps for incorporating association rule for intrusion detection as follows. </li></ul></ul><ul><ul><ul><li>First network data need to be formatted into a database table where each row is an audit record and each column is a field of the audit records. </li></ul></ul></ul><ul><ul><ul><li>There is evidence that intrusions and user activities shows frequent correlations among network data. </li></ul></ul></ul><ul><ul><ul><li>Also rules based on network data can continuously merge the rules from a new run to the aggregate rule. </li></ul></ul></ul>Cont.
  11. 11. Paper - 1 <ul><li>Data Mining Based IDS: </li></ul><ul><ul><li>Data mining is becoming one of the popular techniques for detecting intrusion. IDS can be classified on the basis of their strategy of detection. There are two categories under this classification. </li></ul></ul><ul><ul><ul><li>Misuse Detection Based IDS. </li></ul></ul></ul><ul><ul><ul><li>Anomaly Detection Based IDS. </li></ul></ul></ul>Cont.
  12. 12. Paper - 1 <ul><li>IDS Using both Misuse and Anomaly Detection: </li></ul><ul><ul><li>IDS’s that use both misuse and anomaly intrusion detection techniques. Thus they are capable for detecting both known and unknown intrusions. </li></ul></ul><ul><ul><ul><li>IIDS (Intelligent Intrusion Detection System Architecture). </li></ul></ul></ul><ul><ul><ul><li>RIDS-100(Rising Intrusion Detection System). </li></ul></ul></ul>Cont.
  13. 13. Paper - 2 <ul><li>In this survey they report the findings of their research in the area of anomaly-based intrusion detection systems using data-mining techniques to create a decision tree model of their network using the 1999 DARPA Intrusion Detection Evaluation data set . </li></ul>
  14. 14. Paper - 2 <ul><li>Types Of IDS : </li></ul><ul><ul><li>Misuse Detectors </li></ul></ul><ul><ul><li>Anomaly Based Detectors </li></ul></ul><ul><li>Problem with IDS : </li></ul><ul><ul><li>False Negative </li></ul></ul><ul><ul><li>False Positive </li></ul></ul>Cont.
  15. 15. Paper - 2 <ul><li>Dynamic Modeling : </li></ul><ul><ul><li>Data Preparation </li></ul></ul><ul><ul><ul><li>Studied the data sets’ patterns and modeled the traffic patterns around a generated target variable, “TGT”. </li></ul></ul></ul><ul><ul><ul><li>They used this variable as a predictor target variable for setting the stage. </li></ul></ul></ul><ul><ul><li>Two separate data sets: </li></ul></ul><ul><ul><ul><li>Packet-based Modeling. </li></ul></ul></ul><ul><ul><ul><li>Session Based Modeling </li></ul></ul></ul>Cont.
  16. 16. Paper - 2 <ul><li>Classification Tree Modeling: </li></ul><ul><ul><li>Advantage: </li></ul></ul><ul><ul><ul><li>This method over traditional pattern-recognition methods is that the classification tree is an intuitive tool and relatively easy to interpret. </li></ul></ul></ul>Cont.
  17. 17. Paper - 2 <ul><li>Data Modeling And Assessment: </li></ul><ul><ul><li>Accomplish their data-mining goals through the use of supervised learning techniques on the binary target variable, “TGT”. </li></ul></ul><ul><ul><li>From the revised data sets they further deleted a few extraneous variables date/time, source IP address, and destination IP address variable . </li></ul></ul><ul><ul><li>Partitioned both data sets using stratified random sampling . </li></ul></ul><ul><ul><li>The TGT variable was then specified to form subsets of the original data to improve the classification precision of their model . </li></ul></ul><ul><ul><li>Created a decision tree using the Chi-Square. </li></ul></ul><ul><ul><li>After five leaves of depth, the tree maximizes its profit. </li></ul></ul>Cont.
  18. 18. Paper - 3 <ul><li>Detecting Denial-of-Service Attacks with Incomplete Audit Data </li></ul><ul><li>SCAN (Stochastic Clustering Algorithm for Network anomaly detection) </li></ul><ul><li>Improved version of Expectation-Maximization (EM) Algorithm </li></ul><ul><li>Can handle missing data in audit dataset </li></ul>
  19. 19. Paper - 3 <ul><li>Features of SCAN: </li></ul><ul><ul><li>Improvement in speed of convergence </li></ul></ul><ul><ul><ul><li>Combination of Data Summaries, Bloom Filters, Arrays </li></ul></ul></ul><ul><ul><li>Ability to detect anomaly in absence of complete audit data </li></ul></ul><ul><ul><ul><li>EM computes the maximum likelihood estimates in parametric model based on prior information </li></ul></ul></ul><ul><li>Components of SCAN: </li></ul><ul><ul><li>Online Sampling and Flow Aggregation </li></ul></ul><ul><ul><li>Clustering and Data Reduction </li></ul></ul><ul><ul><ul><li>EM based Clustering Algorithm </li></ul></ul></ul><ul><ul><ul><li>Data Summaries for Data Reduction </li></ul></ul></ul><ul><ul><ul><li>Handling Missing Data in a dataset </li></ul></ul></ul><ul><ul><li>Anomaly Detection </li></ul></ul>Cont.
  20. 20. Paper - 3 <ul><li>System Model Overview: </li></ul>Cont.
  21. 21. Paper - 3 <ul><li>Online Sampling and Flow Aggregation: </li></ul><ul><ul><li>Traffic is sampled and classified into flows </li></ul></ul><ul><ul><ul><li>Flow is all connection with a particular Dest IP and Dest Port </li></ul></ul></ul><ul><ul><li>Connection records provide the following field: </li></ul></ul><ul><ul><ul><li>(SrcIP,SrcPort,DstIP,DstPort,ConnStatus,Duration) </li></ul></ul></ul><ul><ul><li>First identify the flow that packet belongs to </li></ul></ul><ul><ul><li>If not found: Generate a new flow ID based on connection information </li></ul></ul><ul><ul><li>If found: Corresponding flow array is updated </li></ul></ul>Cont.
  22. 22. Paper - 3 <ul><li>Clustering and Data Reduction </li></ul><ul><ul><li>Uses EM based clustering algorithm </li></ul></ul><ul><ul><li>Input: Dataset D and initial estimates </li></ul></ul><ul><ul><li>Output: Clustered set of Data Points </li></ul></ul><ul><ul><li>EM Algorithm has two steps: </li></ul></ul><ul><ul><ul><li>Expectation Step: Finds the expected value of complete data log likelihood </li></ul></ul></ul><ul><ul><ul><li>Maximization Step: Maximizes the expectations computed in the E step </li></ul></ul></ul>Cont.
  23. 23. Paper - 3 Cont. <ul><li>Clustering and Data Reduction </li></ul><ul><ul><li>Data Summaries for Data Reduction </li></ul></ul><ul><ul><ul><li>Execute to build summary for each time slice </li></ul></ul></ul><ul><ul><ul><li>This enhancement reduce dataset </li></ul></ul></ul><ul><ul><ul><li>Consists of 3 stages </li></ul></ul></ul><ul><ul><ul><li>1 st Stage: At the end of each time slice they have made a pass over the connection dataset and built data summaries. </li></ul></ul></ul><ul><ul><ul><li>2 nd Stage: They made another pass over the dataset to built cluster candidates using the data summaries collected in the previous step. After the frequently appearing data have been identified, the algorithm builds all cluster candidates during this pass. </li></ul></ul></ul><ul><ul><ul><li>3 rd Stage: In this stage clusters are selected from the set of candidates. </li></ul></ul></ul>
  24. 24. Paper - 3 Cont. <ul><li>Clustering and Data Reduction </li></ul><ul><ul><li>Handling missing data in a dataset </li></ul></ul><ul><ul><ul><li>At E step the expected value is computed </li></ul></ul></ul><ul><ul><ul><li>At M step that expected value is inserted into the dataset </li></ul></ul></ul><ul><li>Anomaly Detection: </li></ul><ul><ul><li>Anomalous distribution A </li></ul></ul><ul><ul><li>Normal distribution N </li></ul></ul><ul><ul><li>Determine which group the traffic belongs to </li></ul></ul><ul><ul><ul><li>calculated the likelihood of the two cases to determine the distribution to which the incoming flow belongs to. </li></ul></ul></ul><ul><ul><ul><li>Statistical functions </li></ul></ul></ul>
  25. 25. Paper - 4 <ul><li>ADAM: Detecting Intrusion by Data Mining </li></ul><ul><ul><li>Audit Data Analysis and Mining </li></ul></ul><ul><ul><li>Combination of Association Rule and Classification Rule </li></ul></ul><ul><ul><li>Firstly, ADAM collects known frequent datasets </li></ul></ul><ul><ul><li>Secondly, ADAM runs an online algorithm </li></ul></ul><ul><ul><ul><li>Finds last frequent connection records </li></ul></ul></ul><ul><ul><ul><li>Compare them with known mined data </li></ul></ul></ul><ul><ul><ul><li>Discards those, which seems to be normal </li></ul></ul></ul><ul><ul><ul><li>Suspicious ones are forwarded to the classifier </li></ul></ul></ul><ul><ul><ul><li>Trained classifier then classify the suspicious data as one of the following: </li></ul></ul></ul><ul><ul><ul><ul><li>Known type of attack </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Unknown type of attack </li></ul></ul></ul></ul><ul><ul><ul><ul><li>False alarm </li></ul></ul></ul></ul>
  26. 26. Paper - 4 <ul><li>ADAM has two phases in their model </li></ul><ul><li>1 st Phase: Train the classifier </li></ul><ul><ul><li>Offline process </li></ul></ul><ul><ul><li>Takes place only once </li></ul></ul><ul><ul><li>Before the main experiment </li></ul></ul><ul><li>2 nd Phase: Using the trained classifier </li></ul><ul><ul><li>Trained classifier is then used to detect anomalies </li></ul></ul><ul><ul><li>Online process </li></ul></ul>Cont.
  27. 27. Paper - 4 <ul><li>Phase 1: </li></ul>Cont.
  28. 28. Paper - 4 <ul><li>Phase 2: </li></ul>Cont.
  29. 29. Testing Methodology <ul><li>Paper 1: </li></ul><ul><ul><li>In this paper the authors did not use any testing methodology. They described different kinds of data mining techniques and rules to implement in various kinds of data mining based IDS. </li></ul></ul><ul><li>Paper 2: </li></ul><ul><ul><li>The authors of this paper used MIT Lincoln Lab 1999 intrusion detection evaluation (IDEVAL) data sets. </li></ul></ul><ul><ul><li>From this data set they over-sampled and created a new data set that contained a mix of 43.6% attack sessions and 56.4% non-attack sessions. </li></ul></ul><ul><ul><li>From both of data set they took random sampling and allocated 67% of the observations to training data set and 33% to the validation data set. </li></ul></ul>
  30. 30. Testing Methodology Cont. <ul><li>Paper 2: (Continued) </li></ul><ul><ul><li>Packet-Based Results </li></ul></ul><ul><ul><ul><li>They scored the UCF data with the packet-based network model and found that approximately 2.5% of the packets having a probability of 1.0000 of being an attack packet. </li></ul></ul></ul><ul><ul><ul><li>Conversely, a packet that scored a probability of 0.0000 does not necessarily mean that packet is a “good” packet and poses no threat to their campus networks. </li></ul></ul></ul><ul><ul><ul><li>The following figure shows that more than 70% of the packets captured have an attack probability of 0.0000 and 97% of the packets have an attack probability of 0.5000 or less. </li></ul></ul></ul>
  31. 31. Testing Methodology <ul><li>Paper 2: (Continued) </li></ul><ul><ul><li>Packet-Based Results (Continued) </li></ul></ul><ul><ul><ul><li>Overall, out of the approximately 500,000 packets with a 1.0000 probability, there are at least 50,000 packets that require further study. Retraining of their model and readjusting the model’s prior probabilities will allow to see if those remaining packets are truly attack packets or just simply false alarms. </li></ul></ul></ul>Cont.
  32. 32. Testing Methodology Cont. <ul><li>Paper 2: (Continued) </li></ul><ul><ul><li>Session-Based Results </li></ul></ul><ul><ul><ul><li>They also scored the UCF data with the session-based network model and found that approximately 32.9% of the sessions were identified as having a probability of 1.0000 of being an attack session. </li></ul></ul></ul><ul><ul><ul><li>Conversely, a session that scored a probability lower than 1.0000 does not also necessarily mean that session is a “good” session and poses no threat to their campus networks. </li></ul></ul></ul><ul><ul><ul><li>The vast majority of the sessions captured had a low or non-existent probability of being an attack session. Their studies showed that more than 66% of the sessions captured have an attack probability of 0.0129. </li></ul></ul></ul>
  33. 33. Testing Methodology Cont. <ul><li>Paper 3: </li></ul><ul><ul><li>The authors of this paper evaluated SCAN using the 1999 DARPA intrusion detection evaluation data. </li></ul></ul><ul><ul><li>The dataset consists of five weeks of TCPDump data. </li></ul></ul><ul><ul><li>Data from week 1 and 3 consist of normal attack-free network traffic. </li></ul></ul><ul><ul><li>Week 2 data consists of network traffic with labeled attacks. </li></ul></ul><ul><ul><li>The week 4 and 5 data are the “Test Data” and contain 201 instances of 58 different unlabelled attacks, 177 of which are visible in the inside TCPDump data. </li></ul></ul><ul><ul><li>They trained SCAN on the DARPA dataset using week 1 and 3 data, then evaluated the detector on weeks 4 and 5. </li></ul></ul><ul><ul><li>Evaluation was done in an off-line manner. </li></ul></ul>
  34. 34. Testing Methodology Cont. <ul><li>Paper 3: (Continued) </li></ul><ul><ul><li>Simulation Results with Complete Audit Data </li></ul></ul><ul><ul><ul><li>An IDS is evaluated on the basis of accuracy and efficiency. To judge the efficiency and accuracy of SCAN, they used Receiver-Operating Characteristic (ROC) curves. </li></ul></ul></ul><ul><ul><ul><li>An ROC curve, which graphs the rate of detection versus the false positive ratio. </li></ul></ul></ul><ul><ul><ul><li>The performance of SCAN at detecting a SSH Process Table attack is shown in Fig. 5. </li></ul></ul></ul><ul><ul><ul><li>The attack is similar to the Process Table attack in that the goal of the attacker is to cause the SSHD daemon to spawn the maximum number of processes that the operating system will allow. </li></ul></ul></ul>
  35. 35. Testing Methodology Cont. <ul><li>Paper 3: (Continued) </li></ul><ul><ul><li>Simulation Results with Complete Audit Data (Continued) </li></ul></ul><ul><ul><ul><li>In Fig. 6 the performance of SCAN at detecting a SYN flood attack is evaluated. A SYN flood attack is a type of a DoS attack. </li></ul></ul></ul><ul><ul><ul><li>This causes the data structure in the ‘tcpd’ daemon in the server to overflow. </li></ul></ul></ul><ul><ul><ul><li>They compared the performance of SCAN with that of the K-Nearest Neighbors (KNN) clustering algorithm. </li></ul></ul></ul><ul><ul><ul><li>Comparisons of the ROC curves seen in Fig. 5 and 6 suggests that SCAN outperforms the KNN algorithm. </li></ul></ul></ul><ul><ul><ul><li>The two techniques are similar to each other in that both are unsupervised techniques that work with unlabelled data. </li></ul></ul></ul>
  36. 36. Testing Methodology Cont. <ul><li>Paper 4: </li></ul><ul><ul><li>The authors of this paper discussed that ADAM participated in DARPA 1999 intrusion detection evaluation. </li></ul></ul><ul><ul><li>It focused on detecting DOS and PROBE attacks from tcpdump data and performed quite well. </li></ul></ul><ul><ul><li>The following Figures 3 and 4 show the results of DARPA 1999 test data. </li></ul></ul>
  37. 37. Testing Methodology Cont. <ul><li>Paper 4: (Continued) </li></ul>
  38. 38. Conclusion <ul><li>In this report we have studied the details of four papers in this area. </li></ul><ul><li>We have tried to make summary of those four papers, their system models, their technologies and their validation methods. </li></ul><ul><li>We did not go through all the cross-references given in those papers rather we kept the scope of this paper limited into these four papers only. </li></ul><ul><li>We strongly believe that this paper will be able to give the reader a overview on currently development in this area and how data mining is evolving into the field of network intrusion detection. </li></ul>
  39. 39. Bibliography <ul><li>[1] Chang-Tien Lu, Arnold P. Boedihardjo, Prajwal Manalwar, “Exploiting Efficient Data Mining Techniques to Enhance Intrusion Detection Systems. Information Reuse and Integration, Conf, 2005. IRI -2005 IEEE International Conference on. </li></ul><ul><li>[2] Caulkins, B.D.; Joohan Lee; Wang, M, “Packet- vs. session-based modeling for intrusion detection system”, Information Technology: Coding and Computing, 2005. ITCC 2005. International Conference on Volume 1,  4-6 April 2005 Page(s):116 - 121 Vol. 1 Digital Object Identifier 10.1109/ITCC.2005.222 . </li></ul><ul><li>[3] Patcha, A.; Park, J.-M., “Detecting denial-of-service attacks with incomplete audit data”, Computer Communications and Networks, 2005. ICCCN 2005. Proceedings. 14th International Conference on 17-19 Oct. 2005 Page(s):263 - 268 Digital Object Identifier 10.1109/ICCCN.2005.1523864. </li></ul><ul><li>[4] Daniel Barbara, Julia Couto, Sushil Jajodia, Leonard Popyack, Ningning Wu, “ADAM: Detecting Intrusions by Data Mining”, Proceedings of the 2001 IEEE Workshop on Information Assurance and Security, United States Military Academy, West Point, NY, 5-6 June 2001. </li></ul>
  40. 40. Questions