Data Mining: Introduction
Upcoming SlideShare
Loading in...5

Data Mining: Introduction






Total Views
Views on SlideShare
Embed Views



1 Embed 1 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Data Mining: Introduction Data Mining: Introduction Presentation Transcript

  • Intrusion Detection
  • Outline
    • Intrusion detection and computer security
    • Current intrusion detection approaches
    • Data Mining Approaches for Intrusion Detection
    • Summary
  • Intrusion Detection and Computer Security
    • Computer security goals:
      • Confidentiality, integrity, and availability
    • Intrusion is a set of actions aimed to compromise these security goals
    • Intrusion prevention (authentication, encryption, etc.) alone is not sufficient
    • Intrusion detection is needed
  • Intrusion Examples
    • Intrusions : Any set of actions that threaten the integrity, availability, or confidentiality of a network resource
    • Examples
      • Denial of service (DoS): attempts to starve a host of resources needed to function correctly
      • Scan: reconnaissance on the network or a particular host
      • Worms and viruses: replicating on other hosts
      • Compromises: obtain privileged access to a host by known vulnerabilities
  • Intrusion Detection
    • Intrusion detection: The process of monitoring and analyzing the events occurring in a computer and/or network system in order to detect signs of security problems
    • Primary assumption : User and program activities can be monitored and modeled
    • Steps
      • Monitoring and analyzing traffic
      • Identifying abnormal activities
      • Assessing severity and raising alarm
  • Monitoring and Analyzing Traffic
    • TCPdump and Windump
      • Provide insight into the traffic activity on a network
        • ftp://
        • http://
    • Ethereal
      • GUI to interpret all layers of the packet
  • Goals of Intrusion Detection System (IDS)
    • Detect wide variety of intrusions
      • Previously known and unknown attacks
      • Suggests need to learn/adapt to new attacks or changes in behavior
    • Detect intrusions in timely fashion
      • May need to be real-time, especially when system responds to intrusion
        • Problem: analyzing commands may impact response time of system
      • May suffice to report intrusion occurred a few minutes or hours ago
  • Goals of Intrusion Detect. System (IDS) (2)
    • Present analysis in simple, easy-to-understand format
    • Be accurate
      • Minimize false positives, false negatives
        • False positive : An event, incorrectly identified by the IDS as being an intrusion when none has occurred
        • False negative : An event that the IDS fails to identify as an intrusion when one has in fact occurred
      • Minimize time spent verifying attacks, looking for them
  • IDS Architecture
    • Sensors (agent)
      • to collect data and forward info to the analyzer
        • network packets
        • log files
        • system call traces
    • Analyzers (detector)
      • To receive input from one or more sensors or from other analyzers
      • To determine if an intrusion has occurred
    • User interface
      • To enable a user to view output from the system or control the behavior of the system
  • IDS Architecture
  • Signature-Based Intrusion Detection
    • Human analysts investigate suspicious traffic
    • Extract signatures
      • Features of known intrusions
    • Use pre-defined signatures to discover malicious packets
    • Examples
      • LaBrea Tarpit by Tom Liston
      • Snort and Snort rules Marty Roesch
  • Snort by Marty Roesch
    • An open source free network intrusion detection system
      • Signature-based, use a combination of rules and preprocessors
      • On many platforms, including UNIX and Windows
    • Preprocessors
      • IP defragmentation, port-scan detection, web traffic normalization, TCP stream reassembly, …
      • Can analyze streams, not only a single packet at a time
  • Problems in Signature-Based Intrusion Detection Systems
    • Many false positives: prone to generating alerts when there is no problem in fact
      • Signatures are not specific enough
      • A packet is not examined in context with those that precede it or those that follow
    • Cannot detect unknown intrusions
      • Rely on signatures extracted by human experts
  • Misuse vs. Anomaly Detection
    • Misuse detection : use patterns of well-known attacks to identify intrusions
      • Classification based on known intrusions
      • E.g., three consecutive login failures: password guessing.
    • Anomaly detection : use deviation from normal usage patterns to identify intrusions
      • Any significant deviations from the expected behavior are reported as possible attacks
  • Misuse vs. Anomaly Detection STAT [HLMS90]
    • Has to hand-coded known pattern.
    • Unable to detect any future intrusion
    matching the sequence of “signature actions” of known intrusion scenarios Misuse Detection Example Shortcoming Definition IDES [LTG+92]
    • Rely upon in selecting the system features.
    • Has to study sequential interrelation between transactions
    using statistical measure on system features Anomaly Detection
  • Host-based vs. Network-based
    • According to data sources
    • Host-based detection : the data is collected from an individual host
      • Directly monitor the host data files and OS processes
      • Can determine exactly which host resources are the targets of a particular attack
    • Network-based detection : the data is traffic across the network
      • A set of traffic sensors within the network
      • Can easily harder against attacks and hide from the attackers
    • Intrusion detection and computer security
    • Current intrusion detection approaches
    • Data Mining Approaches for Intrusion Detection
    • Summary
  • Current Intrusion Detection Approaches—Misuse Detection
    • Misuse detection :
      • Record the specific patterns of intrusions
      • Monitor current audit trails (event sequences) and pattern matching
      • Report the matched events as intrusions
      • Representation models: expert rules, Colored Petri Net, and state transition diagrams, etc.
  • Misuse Detection Example
    • Expert systems: use a set of rules to describe attacks
      • IDES, ComputerWatch, NIDX, P-BEST, ISOA
    • Signature analysis: capture features of attacks in audit trail
      • Haystack, NetRanger, RealSecure, MuSig
    • State-transition analysis: use state-transition diagrams
      • STAT,USTAT and NetSTAT
    • Other approaches
      • Colored petri nets, e.g., IDIOT
      • Case-based reasoning, e.g., AUTOGUARD
  • Current Intrusion Detection Approaches—Anomaly Detection
    • Anomaly detection:
      • Establishing the normal behavior profiles
      • Observing and comparing current activities with the (normal) profiles
      • Reporting significant deviations as intrusions
      • Statistical measures as behavior profiles: ordinal and categorical (binary and linear)
  • Anomaly Detection Example
    • Statistical methods: multivariate, temporal analysis
    • Expert systems
      • ComputerWatch, Wisdom & Sense
  • Problems of Current Intrusion Detection Approaches
    • Main problems: manual and ad-hoc
      • Misuse detection:
        • Known intrusion patterns have to be hand-coded
        • Unable to detect any new intrusions (that have no matched patterns recorded in the system)
      • Anomaly detection:
        • Selecting the right set of system features to be measured is ad hoc and based on experience
        • Unable to capture sequential interrelation between events
    • Intrusion detection and computer security
    • Current intrusion detection approaches
    • Data Mining Approaches for Intrusion Detection
    • Summary
  • Why Can Data Mining Help?
    • Data mining: applying specific algorithms to extract patterns from data
    • Normal and intrusive activities leave evidence in audit data
    • From the data-centric point view, intrusion detection is a data analysis process
  • Why Can Data Mining Help?
    • Successful applications in related domains, e.g., fraud detection, fault/alarm management
    • Learn from traffic data
      • Supervised learning: learn precise models from past intrusions
      • Unsupervised learning: identify suspicious activities
    • Maintain or update models on dynamic data
  • Frequent Patterns
    • Patterns that occur frequently in a database
    • Mining Frequent patterns – finding regularities
    • Process of Mining Frequent patterns for intrusion detection
      • Phase I: mine a repository of normal frequent itemsets for attack-free data
      • Phase II: find frequent itemsets in the last n connections and compare the patterns to the normal profile
  • Frequent Pattern Mining in MINDS
    • MINDS: a IDS using data mining techniques
      • University of Minnesota
    • Summarizing attacks using association rules
      • {Src IP=, Dest Port=139, Bytes  [150, 200)}  {ATTACK}
  • Patterns About Alerts
    • Ning et al. CCS’02
    • Find correlated alerts – the frequent patterns of alerts
      • Attack scenarios – the logical connections between alerts
      • A hyper-alerts correlation graph approach
    • Use the correlation of intrusion alerts to identify high level attacks
  • Associate rules
    • Used for link analysis
    • E.g.:
      • If the number of failed login attempts ( num_failed_login_attempts ) and the network service on the destination ( service ) are features, an example of rule is:
      • num_failed_login_attempts = 6, service = FTP => attack = DoS [1, 0.28 ]
  • Sequential Pattern Analysis
    • Models sequence patterns
    • (Temporal) order is important in many situations
      • Time-series databases and sequence databases
      • Frequent patterns  (frequent) sequential patterns
    • Sequential patterns for intrusion detection
      • Capture the signatures for attacks in a series of packets
  • Classification: A Two-Step Process
    • Model construction: describe a set of predetermined classes
      • Training dataset: tuples for model construction
        • Each tuple/sample belongs to a predefined class
      • Classification rules, decision trees, or math formulae
    • Model application: classify unseen objects
      • Estimate accuracy of the model using an independent test set
      • Acceptable accuracy  apply the model to classify data tuples with unknown class labels
  • Classification Methods
    • Basic Algorithm ID3
    • Neural networks
    • Bayesian classification
      • Naïve Bayesian classification
      • Bayesian belief network
    • Support vector machines
  • Classification for Intrusion Detection
    • Misuse detection
      • Classification based on known intrusions
    • Example: Sinclair et al. “An application of machine learning to network intrusion detection”
      • Use decision trees and ID3 on host session data
      • Use genetic algorithms to generate rules
        • If <pattern> then <alert>
  • HIDE
    • “ A hierarchical network intrusion detection system using statistical processing and neural network classification” by Zheng et al.
    • Five major components
      • Probes collect traffic data
      • Event preprocessor preprocesses traffic data and feeds the statistical model
      • Statistical processor maintains a model for normal activities and generates vectors for new events
      • Neural network classifies the vectors of new events
      • Post processor generates reports
  • Intrusion Detection by NN and SVM
    • S. Mukkamala et al., IEEE IJCNN May 2002
    • Discover useful patterns or features that describe user behavior on a system
    • Use the set of relevant features to build classifiers
    • SVMs have great potential to be used in place of NNs due to its scalability and faster training and running time
    • NNs are especially suited for multi-category classification
  • Clustering
    • Group data into clusters
    • What is a good clustering
      • High intra-class similarity and low inter-class similarity
        • Depending on the similarity measure
      • The ability to discover some or all of the hidden patterns
    • Clustering Approaches
      • K-means
      • Hierarchical Clustering
      • Density-based methods
      • Grid-based methods
      • Model-based
  • Clustering for Intrusion Detection
    • Anomaly detection
      • Any significant deviations from the expected behavior are reported as possible attacks
    • Build clusters as models for normal activities
    • “ A scalable clustering for intrusion signature recognition” by Ye and Li
      • Use description of clusters as signatures of intrusions
  • Alert Correlation
    • F. Cuppens and A. Miege, in IEEE S&P’02
    • Use clustering and merging functions to recognize alerts that correspond to the same occurrence of an attack
      • Create a new alert that merge data contained in these various alerts
    • Generate global and synthetic alerts to reduce the number of alerts further
  • Mining Data Streams
    • Continuous arrival data in multiple, rapid, time-varying, possibly unpredictable and unbounded streams
    • Many applications
      • Financial applications, network monitoring, security, telecommunications data management, web application, manufacturing, sensor networks, etc.
  • Mining Data Streams for Intrusion Detection
    • Maintaining profiles of normal activities
      • The profiles of normal activities may drift
    • Identifying novel attacks
      • Identifying clusters and outliers in traffic data streams
  • A Systematic Framework—J.Stolfo et al.
    • Build good models:
      • select appropriate features of audit data to build intrusion detection models
    • Build better models:
      • architect a hierarchical detector system that combines multiple detection models
    • Build updated models:
      • dynamically update and deploy new detection system as needed
  • A Systematic Framework
    • Support for the feature selection and model construction:
      • Apply data mining algorithms to find consistent inter- and intra- audit record (event) patterns
      • Use the features and time windows in the discovered patterns to build detection models
      • A support environment to semi-automate this process
  • A Systematic Framework
    • Combining multiple detection models:
      • Each (base) detector model monitors one aspect of the system
      • They can employ different techniques and be independent of each other
      • The learned (meta) detector combines evidence from a number of base detectors
    • An intelligent agent-based architecture :
      • learning agents: continuously compute (learn) the detection models
      • detection agents: use the (updated) models to detect intrusions
  • A Systematic Framework
  • Building Classifiers for Intrusion Detection— J.Stolfo et al.
    • Experiments in constructing classification models for anomaly detection
    • Two experiments:
      • sendmail system call data
      • network tcpdump data
    • Use meta classifier to combine multiple classification models
  • Classification Models on sendmail
    • The data: sequence of system calls made by sendmail .
    • Classification models (rules): describe the “normal” patterns of the system call sequences.
    • The rule set is the normal profile of sendmail
    • Detection: calculate the deviation from the profile
      • large number/high scores of “violations” to the rules in a new trace suggests an exploit
  • Classification Models on sendmail
    • The sendmail data:
      • Each trace has two columns: the process ids and the system call numbers
      • Normal traces: sendmail and sendmail daemon
      • Abnormal traces: sunsendmailcap, syslog-remote, syslog-remote, decode, sm5x and sm56a attacks
  • Classification Models on sendmail
    • Lessons learned:
      • Normal behavior can be established and used to detect anomalous usage
      • Need to collect near “complete” normal data in order to build the “normal” model
      • But how do we know when to stop collecting?
      • Need tools to guide the audit data gathering process
  • Classification Models on tcpdump
    • The tcpdump data (part of a public data visualization contest):
      • Packets of incoming, out-going, and internal broadcast traffic
      • One trace of normal network traffic
      • Three traces of network intrusions
  • Data Preprocessing
    • Extract the “connection” level features:
      • Record connection attempts
      • Watch how connection is terminated
    • Each record has:
      • start time and duration
      • participating hosts and ports (applications)
      • statistics (e.g., # of bytes)
      • flag: normal or a connection/termination error
      • protocol: TCP or UDP
    • Divide connections into 3 types: incoming, out-going, and inter-lan
  • Building Classifier for Each Type of Connections
    • Use the destination service (port) as the class label
    • Training data: 80% of the normal connections
    • Testing data: 20% of the normal connections and connections in the 3 intrusion traces
    • Apply RIPPER to learn rules
  • Lessons Learned
    • Data preprocessing requires extensive domain knowledge
    • Adding temporal features improves classification accuracy
    • Need tools to guide (temporal) feature selection
  • Meta Classifier that Combines Evidence from Multiple Detection Models
    • Build base classifiers that each model one aspect of the system
    • The meta learning task:
      • each record has a collection of evidence from base classifiers, and a class label “normal”or “abnormal” on the state of the system
    • Apply a learning algorithm to produce the meta classifier