ppt slides


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

ppt slides

  1. 1. Areej Al-Bataineh
  2. 2. <ul><li>Data Mining Basics </li></ul><ul><ul><li>Definition </li></ul></ul><ul><ul><li>Some techniques </li></ul></ul><ul><ul><ul><li>Association Rules </li></ul></ul></ul><ul><ul><ul><li>Classification </li></ul></ul></ul><ul><ul><ul><li>Clustering </li></ul></ul></ul><ul><li>Data mining meets Intrusion Detection </li></ul><ul><ul><li>Detection Approaches </li></ul></ul><ul><ul><li>Data mining use in IDS </li></ul></ul><ul><ul><li>Case Study </li></ul></ul><ul><ul><ul><li>Behavioral Feature for Network Anomaly Detection </li></ul></ul></ul><ul><ul><li>Conclusions </li></ul></ul>05/10/10 Data Mining in Intrusion Detection
  3. 3. <ul><li>Knowledge Discovery in Databases (KDD) </li></ul><ul><ul><li>“ Process of extracting useful information from large databases” </li></ul></ul><ul><li>KDD basic steps </li></ul><ul><ul><li>Understanding the application domain </li></ul></ul><ul><ul><li>Data integration and selection </li></ul></ul><ul><ul><li>Data mining </li></ul></ul><ul><ul><li>Pattern Evaluation </li></ul></ul><ul><ul><li>Knowledge representation </li></ul></ul><ul><li>Related Fields </li></ul><ul><ul><li>Machine learning, statistics, others </li></ul></ul>05/10/10 Data Mining in Intrusion Detection
  4. 4. <ul><li>“ concerned with uncovering patterns, associations, changes, anomalies, and statistically significant structures and events in data” </li></ul><ul><li>Why Data Mining? </li></ul><ul><ul><li>Understand existing data </li></ul></ul><ul><ul><li>Predict new data </li></ul></ul><ul><li>Components </li></ul><ul><ul><li>Representation </li></ul></ul><ul><ul><ul><li>Decide on what model can we build. </li></ul></ul></ul><ul><ul><ul><li>Model is a compact summary of examples. </li></ul></ul></ul><ul><ul><li>Learning Element </li></ul></ul><ul><ul><ul><li>Builds a model from a set of examples </li></ul></ul></ul><ul><ul><li>Performance Element </li></ul></ul><ul><ul><ul><li>Applies the model to new observations </li></ul></ul></ul>05/10/10 Data Mining in Intrusion Detection
  5. 5. <ul><li>Well-known and used in Intrusion Detection </li></ul><ul><ul><li>Association Rules [Descriptive] </li></ul></ul><ul><ul><li>Classification [Predictive] </li></ul></ul><ul><ul><li>Clustering [Descriptive] </li></ul></ul><ul><li>Preliminary step </li></ul><ul><ul><li>Raw Data  Database Table ( Training set ) </li></ul></ul><ul><ul><li>Columns – Attributes </li></ul></ul><ul><ul><li>Rows - Records </li></ul></ul>05/10/10 Data Mining in Intrusion Detection
  6. 6. <ul><li>Motivated by market-basket analysis </li></ul><ul><li>Generate Rules that capture implications between attribute values </li></ul><ul><li>Rule Example </li></ul><ul><ul><li>Lettuce & Tomato -> Salad Dressing [0.4, 0.9] </li></ul></ul><ul><li>Parameters [s, c] </li></ul><ul><ul><li>Support (s) % records satisfy LHS and RHS </li></ul></ul><ul><ul><li>Confidence (c) = P(satisfies RHS | satisfies LHS) </li></ul></ul><ul><li>Mining Problem </li></ul><ul><ul><li>“ Find all association rules that have support and confidence > user-defined minimum value” </li></ul></ul>05/10/10 Data Mining in Intrusion Detection
  7. 7. <ul><li>Predefined set of classes </li></ul><ul><li>Training set has Class as one of the attributes </li></ul><ul><ul><li>Supervised Learning </li></ul></ul><ul><li>Mining Problem </li></ul><ul><ul><li>“ Find a model for class attribute as a function of the values of other attributes” </li></ul></ul><ul><li>Use model to predict class </li></ul><ul><li>for new records </li></ul><ul><li>Classifier representation </li></ul><ul><ul><li>If-then Rules </li></ul></ul><ul><ul><li>Decision Trees </li></ul></ul>05/10/10 Data Mining in Intrusion Detection
  8. 8. <ul><li>Given Data Set and Similarity Measure </li></ul><ul><ul><li>Unsupervised Learning </li></ul></ul><ul><li>Mining Problem </li></ul><ul><ul><li>“ Group records into clusters such that all records within a cluster are more similar to one another . And records in separate clusters are less similar another” </li></ul></ul><ul><li>Similarity Measures: </li></ul><ul><ul><li>Euclidean Distance if attributes are continuous. </li></ul></ul><ul><ul><li>Other Problem-specific Measures. </li></ul></ul><ul><li>Clustering Methods </li></ul><ul><ul><li>Partitioning </li></ul></ul><ul><ul><ul><li>Divide data into disjoint partitions </li></ul></ul></ul><ul><ul><li>Hierarchical </li></ul></ul><ul><ul><ul><li>Root is complete data set, Leaves are individual records, and Intermediate layers -> partitions </li></ul></ul></ul>05/10/10 Data Mining in Intrusion Detection
  9. 9. <ul><li>Detection Approach </li></ul><ul><ul><li>Misuse Detection </li></ul></ul><ul><ul><ul><li>Based o known malicious patterns ( signatures ) </li></ul></ul></ul><ul><ul><li>Anomaly Detection </li></ul></ul><ul><ul><ul><li>Based on deviations from established normal patterns ( profiles ) </li></ul></ul></ul><ul><li>Data Source </li></ul><ul><ul><li>Network-based (NIDS) </li></ul></ul><ul><ul><ul><li>Network traffic </li></ul></ul></ul><ul><ul><li>Host-based (HIDS) </li></ul></ul><ul><ul><ul><li>Audit trails </li></ul></ul></ul>05/10/10 Data Mining in Intrusion Detection
  10. 10. <ul><li>Signature extraction </li></ul><ul><li>Rule matching </li></ul><ul><li>Alarm data analysis </li></ul><ul><ul><li>Reduce false alarms </li></ul></ul><ul><ul><li>Eliminate redundant alarms </li></ul></ul><ul><li>Feature selection </li></ul><ul><li>Training Data cleaning </li></ul>05/10/10 Data Mining in Intrusion Detection
  11. 11. <ul><li>Behavioral Feature for Network Anomaly Detection </li></ul><ul><ul><li>Training set = normal network traffic </li></ul></ul><ul><ul><li>Feature provides semantics of the values of data </li></ul></ul><ul><ul><li>Feature selection is important </li></ul></ul><ul><ul><li>Proposed method: </li></ul></ul><ul><ul><ul><li>Feature extraction based on protocol behavior </li></ul></ul></ul><ul><ul><ul><li>Many Attacks uses protocol improperly </li></ul></ul></ul><ul><ul><ul><ul><li>Ping of Death </li></ul></ul></ul></ul><ul><ul><ul><ul><li>SYN Flood </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Teardrop </li></ul></ul></ul></ul>05/10/10 Data Mining in Intrusion Detection
  12. 12. <ul><li>Attributes </li></ul><ul><ul><li>packet header fields </li></ul></ul><ul><li>Feature </li></ul><ul><ul><li>Single or multiple attributes </li></ul></ul><ul><li>Protocol Specifications </li></ul><ul><ul><li>Policy for interaction </li></ul></ul><ul><ul><li>Define attributes and the range of values </li></ul></ul><ul><li>Flow </li></ul><ul><ul><li>Collection of packets exchanged between entities engaged in protocol </li></ul></ul><ul><ul><li>Client/Server flows </li></ul></ul>05/10/10 Data Mining in Intrusion Detection
  13. 13. <ul><li>Inter-Flow vs Intra-Flow Analysis (IVIA) </li></ul><ul><li>First step </li></ul><ul><ul><li>Identify attributes used in partitioning traffic data into flows -> Src/Dst ports </li></ul></ul><ul><ul><li>Result: HTTP flows, DNS flows, …etc </li></ul></ul><ul><li>Next Step </li></ul><ul><ul><li>Examine change of attribute values </li></ul></ul><ul><ul><ul><li>Between flows ( inter-flow ) </li></ul></ul></ul><ul><ul><ul><li>Within a flow ( intra-flow ) </li></ul></ul></ul><ul><li>Results </li></ul><ul><ul><li>Operationally </li></ul></ul><ul><ul><li>Variable </li></ul></ul><ul><ul><li>Attributes </li></ul></ul><ul><ul><li>Flow </li></ul></ul><ul><ul><li>Descriptors </li></ul></ul><ul><ul><li>Operationally </li></ul></ul><ul><ul><li>Invariant </li></ul></ul>05/10/10 Data Mining in Intrusion Detection Intra-Flow Changes Inter-flow Changes Yes No Yes IHL Service Type Total Length Identification Flags_DF Flags_MF Fragment Offset Time to Live Options Source Add Destination Add Protocol No Version Flags_reserved
  14. 14. <ul><li>Uses 1999 DARPA IDS Evaluation data set </li></ul><ul><li>Build association rules for IP fragments using OVAs </li></ul><ul><li>Result - Top 8 ranking rules </li></ul>05/10/10 Data Mining in Intrusion Detection Rule Support Strength ipFlagsMF =1 & ipTTL = 63  ipTLen = 28 0.526 0.981 ipID < 2817 & ipFlagsMF = 1  ipTLen > 28 0.309 0.968 ipID < 2817 & ipTTL > 63  ipTLen > 28 0.299 1.000 ipTLen > 28  ipID < 2817 0.309 1.000 ipID < 2817  ipTLen > 28 0.309 0.927 ipTTL > 63  ipTLen > 28 0.299 0.988 ipTLen > 28  ipTTL > 63 0.299 0.967 ipTLen > 28 & ipOffset > 118  ipTTL > 63 0.291 1.000
  15. 15. <ul><li>Transform OVAs into features that capture the protocol behavior </li></ul><ul><li>Behavior features </li></ul><ul><ul><li>Attribute observed over time/event </li></ul></ul><ul><li>For an attribute observe </li></ul><ul><ul><li>Entropy </li></ul></ul><ul><ul><li>Mean and standard deviations </li></ul></ul><ul><ul><li>Parentage of event within value </li></ul></ul><ul><ul><li>Percentage of events are monotonic </li></ul></ul><ul><ul><li>Step size in attribute value </li></ul></ul><ul><li>Training data requirement are reduced </li></ul><ul><li>Normal – acceptable uses of the protocol </li></ul>05/10/10 Data Mining in Intrusion Detection
  16. 16. <ul><li>Uses aggregate attribute values for some window of packets </li></ul><ul><ul><li>Window size = 10 </li></ul></ul><ul><ul><li>Examples </li></ul></ul><ul><ul><ul><li>TcpPerFIN = % of packets with FIN set </li></ul></ul></ul><ul><ul><ul><li>meanIAT = Mean inter-arrival time </li></ul></ul></ul><ul><li>50 flows for each protocol = 250 flows </li></ul><ul><li>Number of packets per flow (5 – 37000) </li></ul><ul><li>Use decision tree classifier (C5) </li></ul><ul><ul><ul><li>FTP, SSH, Telent, SMTP, HTTP </li></ul></ul></ul><ul><li>Classifier tested on DARPA data set </li></ul><ul><ul><li>FTP SSH Telnet SMTP WWW </li></ul></ul><ul><ul><li>100% 100% 100% 82% 98% </li></ul></ul><ul><li>Real Network Traffic (85% - 100%) </li></ul><ul><ul><li>Kazaa </li></ul></ul><ul><ul><li>100 % </li></ul></ul>05/10/10 Data Mining in Intrusion Detection
  17. 17. 05/10/10 Data Mining in Intrusion Detection >0.01 <=0.01 <=0.4 >0.4 <=0.79 >0.79 >546773 >546773 <=0.03 >0.03 >73 <=73 >0.79
  18. 18. <ul><li>Behavioral Features for Network Anomaly Detection </li></ul><ul><ul><li>Attribute values cannot be used as features </li></ul></ul><ul><ul><li>Interpretation of protocol specifications </li></ul></ul><ul><ul><li>Transform attributes into behavior features </li></ul></ul><ul><ul><li>aggregation of the attribute values </li></ul></ul><ul><li>Data Mining Challenges </li></ul><ul><ul><li>Self-tuning data mining techniques </li></ul></ul><ul><ul><li>Pattern-finding and prior knowledge </li></ul></ul><ul><ul><li>Modeling of temporal data </li></ul></ul><ul><ul><li>Scalability </li></ul></ul><ul><ul><li>Incremental mining </li></ul></ul>05/10/10 Data Mining in Intrusion Detection
  19. 19. <ul><li>Tools </li></ul><ul><ul><li>Kdnuggets </li></ul></ul><ul><ul><ul><li>Web portal http://www.kdnuggets.com </li></ul></ul></ul><ul><ul><li>WEKA </li></ul></ul><ul><ul><ul><li>Most comprehensive and free collection of tools </li></ul></ul></ul><ul><ul><ul><li>http://www.cs.waikato.ac.nz/ml/weka </li></ul></ul></ul><ul><li>Data Sets </li></ul><ul><ul><li>Machine Learning Database Repository </li></ul></ul><ul><ul><li>Knowledge Discovery in Databases Archive </li></ul></ul><ul><ul><ul><li>http://kdd.ics.uci.edu </li></ul></ul></ul><ul><ul><li>MIT Lincolin Labs </li></ul></ul><ul><ul><ul><li>http://www.ll.mit.edu/IST/ideval </li></ul></ul></ul>05/10/10 Data Mining in Intrusion Detection
  20. 20. <ul><li>“ Applications of Data Mining in Computer Security” By Barbara and Jajodia </li></ul><ul><li>“ Machine Learning and Data Mining for Computer Security” By Maloof </li></ul><ul><li>“ Data Mining: Challenges and Opportunities for Data Mining During the Next Decade” By Grossman </li></ul><ul><li>“ Data Mining: Concepts and Techniques” By Han and Kamber </li></ul><ul><li>SANS IDS FAQs </li></ul><ul><ul><li>https://www2.sans.org/resources/idfaq/ </li></ul></ul><ul><li>ACM Crossroads: IDS </li></ul><ul><ul><li>http://www.acm.org/crossroads/xrds2-4/intrus.html </li></ul></ul>05/10/10 Data Mining in Intrusion Detection
  21. 21. <ul><li>OLD </li></ul><ul><ul><li>Represent rules as a decision tree in memory </li></ul></ul><ul><ul><li>Very inefficient </li></ul></ul><ul><ul><li>Speed is linear in term of number of rules </li></ul></ul><ul><ul><li>Rules growing fast </li></ul></ul><ul><li>New </li></ul><ul><ul><li>Multi-pattern search algorithm </li></ul></ul><ul><ul><li>Apply multiple rules in parallel </li></ul></ul><ul><ul><li>Set-wise methodology </li></ul></ul><ul><ul><li>Fire rule with the longest match </li></ul></ul>05/10/10 Data Mining in Intrusion Detection