Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Data Mining Based Intrusion Detection System Krishna C Surendra Babu
  2. 2. <ul><li>Papers: </li></ul><ul><ul><li>A Data Mining Framework for Building Intrusion Detection Models </li></ul></ul><ul><ul><ul><li>(Wenke Lee, Salvotore J. Stolfo) </li></ul></ul></ul><ul><ul><ul><li>- Research supported in parts by grants from DARPA </li></ul></ul></ul><ul><ul><li>Creation and Deployment of Data Mining-Based Intrusion Detection Systems in Oracle Database 10g </li></ul></ul>
  3. 3. <ul><li>Intrusion Detection System: </li></ul><ul><li>Intrusion Detection Techniques: </li></ul><ul><ul><li>Anomaly Detection </li></ul></ul><ul><ul><li>Misuse Detection </li></ul></ul><ul><ul><ul><li>DOS </li></ul></ul></ul><ul><ul><ul><li>Probing </li></ul></ul></ul><ul><ul><ul><li>Unauthorized access to local super user (U2R) </li></ul></ul></ul><ul><ul><ul><li>Unauthorized access from a remote machine (R2L) </li></ul></ul></ul>
  4. 4. <ul><li>Requirements: </li></ul><ul><ul><li>Reliable </li></ul></ul><ul><ul><li>Extensible </li></ul></ul><ul><ul><li>Easy to manage </li></ul></ul><ul><ul><li>Low maintenance cost </li></ul></ul>
  5. 5. <ul><li>Data Mining </li></ul><ul><li>Data mining refers to extracting or mining knowledge from large amounts of data. </li></ul><ul><li>Data Warehouse </li></ul><ul><li>A data warehouse is a repository of information collected from multiple sources </li></ul>A Data Mining Framework for Building Intrusion Detection Models
  6. 7. <ul><li>Why Data Mining? </li></ul><ul><ul><li>The dataset is large. </li></ul></ul><ul><ul><li>Constructing IDS manually is expensive </li></ul></ul><ul><ul><li>and slow. </li></ul></ul><ul><ul><li>Update is frequent since new intrusion </li></ul></ul><ul><ul><li>occurs frequently. </li></ul></ul>A Data Mining Framework for Building Intrusion Detection Models
  7. 8. <ul><li>Challenges for Data Mining in building IDS </li></ul><ul><ul><li>Develop techniques to automate the </li></ul></ul><ul><ul><li>processing of knowledge-intensive feature selection. </li></ul></ul><ul><ul><li>Customize the general algorithm to incorporate domain knowledge so only relevant patterns are reported </li></ul></ul><ul><ul><li>Compute detection models that are accurate and efficient in run-time </li></ul></ul>
  8. 9. <ul><li>Mining the data </li></ul><ul><ul><li>Dataset Types: </li></ul></ul><ul><ul><ul><li>Network based dataset </li></ul></ul></ul><ul><ul><ul><li>Host based dataset </li></ul></ul></ul><ul><ul><li>Build IDS by mining in the records. </li></ul></ul><ul><ul><li>When an attack is detected, give alarms to </li></ul></ul><ul><ul><li>the administration system . </li></ul></ul>
  9. 10. <ul><li>Framework of Building IDS </li></ul><ul><ul><li>Preprocessing. Summarize the raw data. </li></ul></ul><ul><ul><li>Association Rule Mining. </li></ul></ul><ul><ul><li>Find sequence patterns (Frequent Episodes) based on the association rules. </li></ul></ul><ul><ul><li>Construct new features based on the </li></ul></ul><ul><ul><li>sequence patterns. </li></ul></ul><ul><ul><li>Construct Classifiers on different set of </li></ul></ul><ul><ul><li>features </li></ul></ul>
  10. 11. <ul><li>Preprocessing </li></ul><ul><ul><li>To summarize raw data to high level event, </li></ul></ul><ul><ul><ul><li>e.g network connection, time, duration, service, host, destination </li></ul></ul></ul><ul><ul><li>Bro and NFR Packet filtering Techniques can be used. </li></ul></ul>
  11. 13. <ul><li>Classification </li></ul><ul><ul><li>Classify each audit record into one of a discrete set of possible categories, normal or a particular kind of intrusion. </li></ul></ul>
  12. 14. <ul><li>Association rule mining </li></ul><ul><li>Searches for interesting relationships among attributes in a given data set i.e. to derieve multi feature(attribute) correlations from a database table. </li></ul>
  13. 15. <ul><li>Sequence Pattern Mining </li></ul><ul><li>Frequent Episodes. </li></ul><ul><ul><li>X,Y->Z, [ c , s , w ] </li></ul></ul><ul><ul><li>With the existence of itemset X and Y, Z will occur in time w . </li></ul></ul>
  14. 16. <ul><li>Feature Construction </li></ul><ul><ul><li>Feature extraction is the processes of determining what evidence that can be taken from raw audit data is most useful for analysis. </li></ul></ul><ul><ul><li>Construct new feature according to the </li></ul></ul><ul><ul><li>frequent episode. </li></ul></ul><ul><ul><ul><li>Some features will show close relationship to </li></ul></ul></ul><ul><ul><ul><li>each other. Then combine the features. </li></ul></ul></ul><ul><ul><ul><li>Some frequent episode may indicate interesting new features. </li></ul></ul></ul>
  15. 17. <ul><li>Build Model (classifier) </li></ul><ul><ul><li>Build different classifiers for different </li></ul></ul><ul><ul><li>attacks. </li></ul></ul>
  16. 18. <ul><li>Experiments </li></ul><ul><li>The DARPA data </li></ul><ul><ul><li>4G compressed tcpdump data of 7 weeks of network traffics. </li></ul></ul><ul><ul><li>Contains 4 main categories of attacks </li></ul></ul><ul><ul><ul><li>DOS : denial of service, e.g., ping-of-death, syn flood </li></ul></ul></ul><ul><ul><ul><li>R2L : unauthorized access from a remote machine, </li></ul></ul></ul><ul><ul><ul><ul><li>e.g., guessing password </li></ul></ul></ul></ul><ul><ul><ul><li>U2R : unauthorized access to local super user privileges by a local unprivileged user, e.g., buffer overflow </li></ul></ul></ul><ul><ul><ul><li>PROBING : e.g., port-scan, ping-sweep </li></ul></ul></ul>
  17. 19. <ul><li>Results </li></ul><ul><li>Training on the 7 weeks of labeled data, and testing on </li></ul><ul><li>the 2 weeks unlabeled data. </li></ul><ul><li>The test data contains 14 attack types which do not </li></ul><ul><li>exist in training data. </li></ul><ul><li>Comparing 4 methods: </li></ul><ul><ul><li>Columbia: the IDS developed according to the framework </li></ul></ul><ul><li>introduced above </li></ul><ul><ul><li>Group 1-3: three systems developed by knowledge engineering approaches. </li></ul></ul>
  18. 21. <ul><li>Results </li></ul><ul><li>Detection rate on New and Old attacks. </li></ul><ul><ul><li>Old attacks: type of attacks occur in both training and testing data. </li></ul></ul><ul><ul><li>New attacks: type of attacks occur in testing data only. </li></ul></ul>
  19. 22. Creation and Deployment of Data Mining Based Intrusion Detection Systems in Oracle Database 10G <ul><li>DAID </li></ul><ul><li> A database centric architecture that leverages data mining with in the Oracle RDBMS to address the challenges. </li></ul><ul><ul><li>Scheduling capabilities </li></ul></ul><ul><ul><li>Alert infrastructure </li></ul></ul><ul><ul><li>Data analysis tools </li></ul></ul><ul><ul><li>Security </li></ul></ul><ul><ul><li>Scalability </li></ul></ul><ul><ul><li>reliability </li></ul></ul>
  20. 23. <ul><li>Requirements for a production quality IDS </li></ul><ul><ul><li>Centralized view of the data </li></ul></ul><ul><ul><li>Data transformation capabilities </li></ul></ul><ul><ul><li>Analytic and data mining methods </li></ul></ul><ul><ul><li>Flexible detector deployment, including scheduling that enables periodic model creation and distribution </li></ul></ul><ul><ul><li>Real-time detection and alert infrastructure </li></ul></ul><ul><ul><li>Reporting capabilities </li></ul></ul><ul><ul><li>Distributed processing </li></ul></ul><ul><ul><li>High system availability </li></ul></ul><ul><ul><li>Scalability with system load </li></ul></ul>
  21. 24. <ul><li>• Sensors </li></ul><ul><li>• Extraction, transformation and load (ETL) </li></ul><ul><li>• Centralized data warehousing </li></ul><ul><li>• Automated model generation </li></ul><ul><li>• Automated model distribution </li></ul><ul><li>• Real-time and offline detection </li></ul><ul><li>• Report and analysis </li></ul><ul><li>• Automated alerts </li></ul>
  22. 25. <ul><li>Sensors </li></ul><ul><ul><li>Collects audit information </li></ul></ul><ul><ul><ul><li>Network traffic data </li></ul></ul></ul><ul><ul><ul><li>System logs on individual hosts </li></ul></ul></ul><ul><ul><ul><li>System calls made by processes </li></ul></ul></ul>
  23. 26. <ul><li>ETL </li></ul><ul><ul><li>Used for pre processing audit streams and feature extraction </li></ul></ul><ul><ul><li>Use SQL and user defined functions to extract key pieces of information. </li></ul></ul><ul><ul><ul><li>Ex: computes windowing analytic function to compute the number of http connections to a given host </li></ul></ul></ul>
  24. 27. <ul><li>Model Generation </li></ul><ul><li>Popular Techniques for misuse and anomaly detection: </li></ul><ul><ul><li>Association Rules </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Support Vector Machines </li></ul></ul><ul><ul><ul><li>Supervised learning methods for Classification </li></ul></ul></ul><ul><ul><li>Decision Trees </li></ul></ul>
  25. 28. <ul><li>Model build functionality: </li></ul><ul><ul><li>Dbms_data_mining PL/SQL package </li></ul></ul><ul><ul><li>to train linear SVM anomaly and misuse detection models. </li></ul></ul><ul><li>Test dataset </li></ul><ul><ul><li>Probing </li></ul></ul><ul><ul><li>Denial of service </li></ul></ul><ul><ul><li>Unauthorized access to a local superuser(u2r) </li></ul></ul><ul><ul><li>Unauthorized access from a remote machine(r2l) </li></ul></ul><ul><ul><li>(37 subclasses of attacks under the 4 generic categories) </li></ul></ul>
  26. 29. <ul><li>Misuse Detection Problem </li></ul><ul><li>Anomaly Detection Problem </li></ul><ul><ul><li>Accuracy of the system 92.1% </li></ul></ul>
  27. 30. <ul><li>Periodic Model Updates as new data is accumulated </li></ul><ul><li>Model rebuild when the performance falls below a predefined level </li></ul>
  28. 31. <ul><li>Model Distribution </li></ul><ul><li>Real Application Clusters (RAC) </li></ul>
  29. 32. <ul><li>Detection </li></ul><ul><ul><li>Real time / offline </li></ul></ul><ul><ul><li>Audit data are classified as attack or not by misuse detection SVM model. </li></ul></ul>
  30. 33. <ul><li>Functional index on the probability of a case being an attack or not </li></ul><ul><li>returns all cases in audit_data with probability greater than 0.5 of being an attack </li></ul>
  31. 34. <ul><li>Combination of multiple models </li></ul><ul><li>The query returns all cases where either model1 or model2 indicate an attack with probability higher than 0.4: </li></ul><ul><li>In this case, when the anomaly_model classifies a case as an attack with probability greater than 0.5, the misuse_model will attempt to identify the type of attack: </li></ul>
  32. 35. <ul><li>Reports and Analysis </li></ul>
  33. 38. <ul><li>Conclusion </li></ul><ul><ul><li>Data mining techniques are very useful in </li></ul></ul><ul><ul><li>Intrusion Detection </li></ul></ul><ul><ul><li>Still need manually interpretation/advice </li></ul></ul><ul><ul><li>in some processing steps </li></ul></ul><ul><ul><li>More efficient on known attacks than on </li></ul></ul><ul><li> unknown attacks only if the training data contains all normal behavior </li></ul>