Data Mining: Introduction


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Mining: Introduction

  1. 1. Intrusion Detection
  2. 2. Outline <ul><li>Intrusion detection and computer security </li></ul><ul><li>Current intrusion detection approaches </li></ul><ul><li>Data Mining Approaches for Intrusion Detection </li></ul><ul><li>Summary </li></ul>
  3. 3. Intrusion Detection and Computer Security <ul><li>Computer security goals: </li></ul><ul><ul><li>Confidentiality, integrity, and availability </li></ul></ul><ul><li>Intrusion is a set of actions aimed to compromise these security goals </li></ul><ul><li>Intrusion prevention (authentication, encryption, etc.) alone is not sufficient </li></ul><ul><li>Intrusion detection is needed </li></ul>
  4. 4. Intrusion Examples <ul><li>Intrusions : Any set of actions that threaten the integrity, availability, or confidentiality of a network resource </li></ul><ul><li>Examples </li></ul><ul><ul><li>Denial of service (DoS): attempts to starve a host of resources needed to function correctly </li></ul></ul><ul><ul><li>Scan: reconnaissance on the network or a particular host </li></ul></ul><ul><ul><li>Worms and viruses: replicating on other hosts </li></ul></ul><ul><ul><li>Compromises: obtain privileged access to a host by known vulnerabilities </li></ul></ul>
  5. 5. Intrusion Detection <ul><li>Intrusion detection: The process of monitoring and analyzing the events occurring in a computer and/or network system in order to detect signs of security problems </li></ul><ul><li>Primary assumption : User and program activities can be monitored and modeled </li></ul><ul><li>Steps </li></ul><ul><ul><li>Monitoring and analyzing traffic </li></ul></ul><ul><ul><li>Identifying abnormal activities </li></ul></ul><ul><ul><li>Assessing severity and raising alarm </li></ul></ul>
  6. 6. Monitoring and Analyzing Traffic <ul><li>TCPdump and Windump </li></ul><ul><ul><li>Provide insight into the traffic activity on a network </li></ul></ul><ul><ul><ul><li>ftp:// </li></ul></ul></ul><ul><ul><ul><li>http:// </li></ul></ul></ul><ul><li>Ethereal </li></ul><ul><ul><li>GUI to interpret all layers of the packet </li></ul></ul>
  7. 7. Goals of Intrusion Detection System (IDS) <ul><li>Detect wide variety of intrusions </li></ul><ul><ul><li>Previously known and unknown attacks </li></ul></ul><ul><ul><li>Suggests need to learn/adapt to new attacks or changes in behavior </li></ul></ul><ul><li>Detect intrusions in timely fashion </li></ul><ul><ul><li>May need to be real-time, especially when system responds to intrusion </li></ul></ul><ul><ul><ul><li>Problem: analyzing commands may impact response time of system </li></ul></ul></ul><ul><ul><li>May suffice to report intrusion occurred a few minutes or hours ago </li></ul></ul>
  8. 8. Goals of Intrusion Detect. System (IDS) (2) <ul><li>Present analysis in simple, easy-to-understand format </li></ul><ul><li>Be accurate </li></ul><ul><ul><li>Minimize false positives, false negatives </li></ul></ul><ul><ul><ul><li>False positive : An event, incorrectly identified by the IDS as being an intrusion when none has occurred </li></ul></ul></ul><ul><ul><ul><li>False negative : An event that the IDS fails to identify as an intrusion when one has in fact occurred </li></ul></ul></ul><ul><ul><li>Minimize time spent verifying attacks, looking for them </li></ul></ul>
  9. 9. IDS Architecture <ul><li>Sensors (agent) </li></ul><ul><ul><li>to collect data and forward info to the analyzer </li></ul></ul><ul><ul><ul><li>network packets </li></ul></ul></ul><ul><ul><ul><li>log files </li></ul></ul></ul><ul><ul><ul><li>system call traces </li></ul></ul></ul><ul><li>Analyzers (detector) </li></ul><ul><ul><li>To receive input from one or more sensors or from other analyzers </li></ul></ul><ul><ul><li>To determine if an intrusion has occurred </li></ul></ul><ul><li>User interface </li></ul><ul><ul><li>To enable a user to view output from the system or control the behavior of the system </li></ul></ul>
  10. 10. IDS Architecture
  11. 11. Signature-Based Intrusion Detection <ul><li>Human analysts investigate suspicious traffic </li></ul><ul><li>Extract signatures </li></ul><ul><ul><li>Features of known intrusions </li></ul></ul><ul><li>Use pre-defined signatures to discover malicious packets </li></ul><ul><li>Examples </li></ul><ul><ul><li>LaBrea Tarpit by Tom Liston </li></ul></ul><ul><ul><li>Snort and Snort rules Marty Roesch </li></ul></ul>
  12. 12. Snort by Marty Roesch <ul><li>An open source free network intrusion detection system </li></ul><ul><ul><li>Signature-based, use a combination of rules and preprocessors </li></ul></ul><ul><ul><li>On many platforms, including UNIX and Windows </li></ul></ul><ul><ul><li> </li></ul></ul><ul><li>Preprocessors </li></ul><ul><ul><li>IP defragmentation, port-scan detection, web traffic normalization, TCP stream reassembly, … </li></ul></ul><ul><ul><li>Can analyze streams, not only a single packet at a time </li></ul></ul>
  13. 13. Problems in Signature-Based Intrusion Detection Systems <ul><li>Many false positives: prone to generating alerts when there is no problem in fact </li></ul><ul><ul><li>Signatures are not specific enough </li></ul></ul><ul><ul><li>A packet is not examined in context with those that precede it or those that follow </li></ul></ul><ul><li>Cannot detect unknown intrusions </li></ul><ul><ul><li>Rely on signatures extracted by human experts </li></ul></ul>
  14. 14. Misuse vs. Anomaly Detection <ul><li>Misuse detection : use patterns of well-known attacks to identify intrusions </li></ul><ul><ul><li>Classification based on known intrusions </li></ul></ul><ul><ul><li>E.g., three consecutive login failures: password guessing. </li></ul></ul><ul><li>Anomaly detection : use deviation from normal usage patterns to identify intrusions </li></ul><ul><ul><li>Any significant deviations from the expected behavior are reported as possible attacks </li></ul></ul>
  15. 15. Misuse vs. Anomaly Detection STAT [HLMS90] <ul><li>Has to hand-coded known pattern. </li></ul><ul><li>Unable to detect any future intrusion </li></ul>matching the sequence of “signature actions” of known intrusion scenarios Misuse Detection Example Shortcoming Definition IDES [LTG+92] <ul><li>Rely upon in selecting the system features. </li></ul><ul><li>Has to study sequential interrelation between transactions </li></ul>using statistical measure on system features Anomaly Detection
  16. 16. Host-based vs. Network-based <ul><li>According to data sources </li></ul><ul><li>Host-based detection : the data is collected from an individual host </li></ul><ul><ul><li>Directly monitor the host data files and OS processes </li></ul></ul><ul><ul><li>Can determine exactly which host resources are the targets of a particular attack </li></ul></ul><ul><li>Network-based detection : the data is traffic across the network </li></ul><ul><ul><li>A set of traffic sensors within the network </li></ul></ul><ul><ul><li>Can easily harder against attacks and hide from the attackers </li></ul></ul>
  17. 17. OUTLINE <ul><li>Intrusion detection and computer security </li></ul><ul><li>Current intrusion detection approaches </li></ul><ul><li>Data Mining Approaches for Intrusion Detection </li></ul><ul><li>Summary </li></ul>
  18. 18. Current Intrusion Detection Approaches—Misuse Detection <ul><li>Misuse detection : </li></ul><ul><ul><li>Record the specific patterns of intrusions </li></ul></ul><ul><ul><li>Monitor current audit trails (event sequences) and pattern matching </li></ul></ul><ul><ul><li>Report the matched events as intrusions </li></ul></ul><ul><ul><li>Representation models: expert rules, Colored Petri Net, and state transition diagrams, etc. </li></ul></ul>
  19. 19. Misuse Detection Example <ul><li>Expert systems: use a set of rules to describe attacks </li></ul><ul><ul><li>IDES, ComputerWatch, NIDX, P-BEST, ISOA </li></ul></ul><ul><li>Signature analysis: capture features of attacks in audit trail </li></ul><ul><ul><li>Haystack, NetRanger, RealSecure, MuSig </li></ul></ul><ul><li>State-transition analysis: use state-transition diagrams </li></ul><ul><ul><li>STAT,USTAT and NetSTAT </li></ul></ul><ul><li>Other approaches </li></ul><ul><ul><li>Colored petri nets, e.g., IDIOT </li></ul></ul><ul><ul><li>Case-based reasoning, e.g., AUTOGUARD </li></ul></ul>
  20. 20. Current Intrusion Detection Approaches—Anomaly Detection <ul><li>Anomaly detection: </li></ul><ul><ul><li>Establishing the normal behavior profiles </li></ul></ul><ul><ul><li>Observing and comparing current activities with the (normal) profiles </li></ul></ul><ul><ul><li>Reporting significant deviations as intrusions </li></ul></ul><ul><ul><li>Statistical measures as behavior profiles: ordinal and categorical (binary and linear) </li></ul></ul>
  21. 21. Anomaly Detection Example <ul><li>Statistical methods: multivariate, temporal analysis </li></ul><ul><ul><li>IDES, NIDES, EMERALD </li></ul></ul><ul><li>Expert systems </li></ul><ul><ul><li>ComputerWatch, Wisdom & Sense </li></ul></ul>
  22. 22. Problems of Current Intrusion Detection Approaches <ul><li>Main problems: manual and ad-hoc </li></ul><ul><ul><li>Misuse detection: </li></ul></ul><ul><ul><ul><li>Known intrusion patterns have to be hand-coded </li></ul></ul></ul><ul><ul><ul><li>Unable to detect any new intrusions (that have no matched patterns recorded in the system) </li></ul></ul></ul><ul><ul><li>Anomaly detection: </li></ul></ul><ul><ul><ul><li>Selecting the right set of system features to be measured is ad hoc and based on experience </li></ul></ul></ul><ul><ul><ul><li>Unable to capture sequential interrelation between events </li></ul></ul></ul>
  23. 23. OUTLINE <ul><li>Intrusion detection and computer security </li></ul><ul><li>Current intrusion detection approaches </li></ul><ul><li>Data Mining Approaches for Intrusion Detection </li></ul><ul><li>Summary </li></ul>
  24. 24. Why Can Data Mining Help? <ul><li>Data mining: applying specific algorithms to extract patterns from data </li></ul><ul><li>Normal and intrusive activities leave evidence in audit data </li></ul><ul><li>From the data-centric point view, intrusion detection is a data analysis process </li></ul>
  25. 25. Why Can Data Mining Help? <ul><li>Successful applications in related domains, e.g., fraud detection, fault/alarm management </li></ul><ul><li>Learn from traffic data </li></ul><ul><ul><li>Supervised learning: learn precise models from past intrusions </li></ul></ul><ul><ul><li>Unsupervised learning: identify suspicious activities </li></ul></ul><ul><li>Maintain or update models on dynamic data </li></ul>
  26. 28. Frequent Patterns <ul><li>Patterns that occur frequently in a database </li></ul><ul><li>Mining Frequent patterns – finding regularities </li></ul><ul><li>Process of Mining Frequent patterns for intrusion detection </li></ul><ul><ul><li>Phase I: mine a repository of normal frequent itemsets for attack-free data </li></ul></ul><ul><ul><li>Phase II: find frequent itemsets in the last n connections and compare the patterns to the normal profile </li></ul></ul>
  27. 29. Frequent Pattern Mining in MINDS <ul><li>MINDS: a IDS using data mining techniques </li></ul><ul><ul><li>University of Minnesota </li></ul></ul><ul><li>Summarizing attacks using association rules </li></ul><ul><ul><li>{Src IP=, Dest Port=139, Bytes  [150, 200)}  {ATTACK} </li></ul></ul>
  28. 30. Patterns About Alerts <ul><li>Ning et al. CCS’02 </li></ul><ul><li>Find correlated alerts – the frequent patterns of alerts </li></ul><ul><ul><li>Attack scenarios – the logical connections between alerts </li></ul></ul><ul><ul><li>A hyper-alerts correlation graph approach </li></ul></ul><ul><li>Use the correlation of intrusion alerts to identify high level attacks </li></ul>
  29. 31. Associate rules <ul><li>Used for link analysis </li></ul><ul><li>E.g.: </li></ul><ul><ul><li>If the number of failed login attempts ( num_failed_login_attempts ) and the network service on the destination ( service ) are features, an example of rule is: </li></ul></ul><ul><ul><li>num_failed_login_attempts = 6, service = FTP => attack = DoS [1, 0.28 ] </li></ul></ul>
  30. 32. Sequential Pattern Analysis <ul><li>Models sequence patterns </li></ul><ul><li>(Temporal) order is important in many situations </li></ul><ul><ul><li>Time-series databases and sequence databases </li></ul></ul><ul><ul><li>Frequent patterns  (frequent) sequential patterns </li></ul></ul><ul><li>Sequential patterns for intrusion detection </li></ul><ul><ul><li>Capture the signatures for attacks in a series of packets </li></ul></ul>
  31. 33. Classification: A Two-Step Process <ul><li>Model construction: describe a set of predetermined classes </li></ul><ul><ul><li>Training dataset: tuples for model construction </li></ul></ul><ul><ul><ul><li>Each tuple/sample belongs to a predefined class </li></ul></ul></ul><ul><ul><li>Classification rules, decision trees, or math formulae </li></ul></ul><ul><li>Model application: classify unseen objects </li></ul><ul><ul><li>Estimate accuracy of the model using an independent test set </li></ul></ul><ul><ul><li>Acceptable accuracy  apply the model to classify data tuples with unknown class labels </li></ul></ul>
  32. 34. Classification Methods <ul><li>Basic Algorithm ID3 </li></ul><ul><li>Neural networks </li></ul><ul><li>Bayesian classification </li></ul><ul><ul><li>Naïve Bayesian classification </li></ul></ul><ul><ul><li>Bayesian belief network </li></ul></ul><ul><li>Support vector machines </li></ul>
  33. 35. Classification for Intrusion Detection <ul><li>Misuse detection </li></ul><ul><ul><li>Classification based on known intrusions </li></ul></ul><ul><li>Example: Sinclair et al. “An application of machine learning to network intrusion detection” </li></ul><ul><ul><li>Use decision trees and ID3 on host session data </li></ul></ul><ul><ul><li>Use genetic algorithms to generate rules </li></ul></ul><ul><ul><ul><li>If <pattern> then <alert> </li></ul></ul></ul>
  34. 36. HIDE <ul><li>“ A hierarchical network intrusion detection system using statistical processing and neural network classification” by Zheng et al. </li></ul><ul><li>Five major components </li></ul><ul><ul><li>Probes collect traffic data </li></ul></ul><ul><ul><li>Event preprocessor preprocesses traffic data and feeds the statistical model </li></ul></ul><ul><ul><li>Statistical processor maintains a model for normal activities and generates vectors for new events </li></ul></ul><ul><ul><li>Neural network classifies the vectors of new events </li></ul></ul><ul><ul><li>Post processor generates reports </li></ul></ul>
  35. 37. Intrusion Detection by NN and SVM <ul><li>S. Mukkamala et al., IEEE IJCNN May 2002 </li></ul><ul><li>Discover useful patterns or features that describe user behavior on a system </li></ul><ul><li>Use the set of relevant features to build classifiers </li></ul><ul><li>SVMs have great potential to be used in place of NNs due to its scalability and faster training and running time </li></ul><ul><li>NNs are especially suited for multi-category classification </li></ul>
  36. 38. Clustering <ul><li>Group data into clusters </li></ul><ul><li>What is a good clustering </li></ul><ul><ul><li>High intra-class similarity and low inter-class similarity </li></ul></ul><ul><ul><ul><li>Depending on the similarity measure </li></ul></ul></ul><ul><ul><li>The ability to discover some or all of the hidden patterns </li></ul></ul><ul><li>Clustering Approaches </li></ul><ul><ul><li>K-means </li></ul></ul><ul><ul><li>Hierarchical Clustering </li></ul></ul><ul><ul><li>Density-based methods </li></ul></ul><ul><ul><li>Grid-based methods </li></ul></ul><ul><ul><li>Model-based </li></ul></ul>
  37. 39. Clustering for Intrusion Detection <ul><li>Anomaly detection </li></ul><ul><ul><li>Any significant deviations from the expected behavior are reported as possible attacks </li></ul></ul><ul><li>Build clusters as models for normal activities </li></ul><ul><li>“ A scalable clustering for intrusion signature recognition” by Ye and Li </li></ul><ul><ul><li>Use description of clusters as signatures of intrusions </li></ul></ul>
  38. 40. Alert Correlation <ul><li>F. Cuppens and A. Miege, in IEEE S&P’02 </li></ul><ul><li>Use clustering and merging functions to recognize alerts that correspond to the same occurrence of an attack </li></ul><ul><ul><li>Create a new alert that merge data contained in these various alerts </li></ul></ul><ul><li>Generate global and synthetic alerts to reduce the number of alerts further </li></ul>
  39. 41. Mining Data Streams <ul><li>Continuous arrival data in multiple, rapid, time-varying, possibly unpredictable and unbounded streams </li></ul><ul><li>Many applications </li></ul><ul><ul><li>Financial applications, network monitoring, security, telecommunications data management, web application, manufacturing, sensor networks, etc. </li></ul></ul>
  40. 42. Mining Data Streams for Intrusion Detection <ul><li>Maintaining profiles of normal activities </li></ul><ul><ul><li>The profiles of normal activities may drift </li></ul></ul><ul><li>Identifying novel attacks </li></ul><ul><ul><li>Identifying clusters and outliers in traffic data streams </li></ul></ul>
  41. 43. A Systematic Framework—J.Stolfo et al. <ul><li>Build good models: </li></ul><ul><ul><li>select appropriate features of audit data to build intrusion detection models </li></ul></ul><ul><li>Build better models: </li></ul><ul><ul><li>architect a hierarchical detector system that combines multiple detection models </li></ul></ul><ul><li>Build updated models: </li></ul><ul><ul><li>dynamically update and deploy new detection system as needed </li></ul></ul>
  42. 44. A Systematic Framework <ul><li>Support for the feature selection and model construction: </li></ul><ul><ul><li>Apply data mining algorithms to find consistent inter- and intra- audit record (event) patterns </li></ul></ul><ul><ul><li>Use the features and time windows in the discovered patterns to build detection models </li></ul></ul><ul><ul><li>A support environment to semi-automate this process </li></ul></ul>
  43. 45. A Systematic Framework <ul><li>Combining multiple detection models: </li></ul><ul><ul><li>Each (base) detector model monitors one aspect of the system </li></ul></ul><ul><ul><li>They can employ different techniques and be independent of each other </li></ul></ul><ul><ul><li>The learned (meta) detector combines evidence from a number of base detectors </li></ul></ul><ul><li>An intelligent agent-based architecture : </li></ul><ul><ul><li>learning agents: continuously compute (learn) the detection models </li></ul></ul><ul><ul><li>detection agents: use the (updated) models to detect intrusions </li></ul></ul>
  44. 46. A Systematic Framework
  45. 47. Building Classifiers for Intrusion Detection— J.Stolfo et al. <ul><li>Experiments in constructing classification models for anomaly detection </li></ul><ul><li>Two experiments: </li></ul><ul><ul><li>sendmail system call data </li></ul></ul><ul><ul><li>network tcpdump data </li></ul></ul><ul><li>Use meta classifier to combine multiple classification models </li></ul>
  46. 48. Classification Models on sendmail <ul><li>The data: sequence of system calls made by sendmail . </li></ul><ul><li>Classification models (rules): describe the “normal” patterns of the system call sequences. </li></ul><ul><li>The rule set is the normal profile of sendmail </li></ul><ul><li>Detection: calculate the deviation from the profile </li></ul><ul><ul><li>large number/high scores of “violations” to the rules in a new trace suggests an exploit </li></ul></ul>
  47. 49. Classification Models on sendmail <ul><li>The sendmail data: </li></ul><ul><ul><li>Each trace has two columns: the process ids and the system call numbers </li></ul></ul><ul><ul><li>Normal traces: sendmail and sendmail daemon </li></ul></ul><ul><ul><li>Abnormal traces: sunsendmailcap, syslog-remote, syslog-remote, decode, sm5x and sm56a attacks </li></ul></ul>
  48. 50. Classification Models on sendmail <ul><li>Lessons learned: </li></ul><ul><ul><li>Normal behavior can be established and used to detect anomalous usage </li></ul></ul><ul><ul><li>Need to collect near “complete” normal data in order to build the “normal” model </li></ul></ul><ul><ul><li>But how do we know when to stop collecting? </li></ul></ul><ul><ul><li>Need tools to guide the audit data gathering process </li></ul></ul>
  49. 51. Classification Models on tcpdump <ul><li>The tcpdump data (part of a public data visualization contest): </li></ul><ul><ul><li>Packets of incoming, out-going, and internal broadcast traffic </li></ul></ul><ul><ul><li>One trace of normal network traffic </li></ul></ul><ul><ul><li>Three traces of network intrusions </li></ul></ul>
  50. 52. Data Preprocessing <ul><li>Extract the “connection” level features: </li></ul><ul><ul><li>Record connection attempts </li></ul></ul><ul><ul><li>Watch how connection is terminated </li></ul></ul><ul><li>Each record has: </li></ul><ul><ul><li>start time and duration </li></ul></ul><ul><ul><li>participating hosts and ports (applications) </li></ul></ul><ul><ul><li>statistics (e.g., # of bytes) </li></ul></ul><ul><ul><li>flag: normal or a connection/termination error </li></ul></ul><ul><ul><li>protocol: TCP or UDP </li></ul></ul><ul><li>Divide connections into 3 types: incoming, out-going, and inter-lan </li></ul>
  51. 53. Building Classifier for Each Type of Connections <ul><li>Use the destination service (port) as the class label </li></ul><ul><li>Training data: 80% of the normal connections </li></ul><ul><li>Testing data: 20% of the normal connections and connections in the 3 intrusion traces </li></ul><ul><li>Apply RIPPER to learn rules </li></ul>
  52. 54. Lessons Learned <ul><li>Data preprocessing requires extensive domain knowledge </li></ul><ul><li>Adding temporal features improves classification accuracy </li></ul><ul><li>Need tools to guide (temporal) feature selection </li></ul>
  53. 55. Meta Classifier that Combines Evidence from Multiple Detection Models <ul><li>Build base classifiers that each model one aspect of the system </li></ul><ul><li>The meta learning task: </li></ul><ul><ul><li>each record has a collection of evidence from base classifiers, and a class label “normal”or “abnormal” on the state of the system </li></ul></ul><ul><li>Apply a learning algorithm to produce the meta classifier </li></ul>