Data Mining for Security Applications


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Mining for Security Applications

  1. 1. Data Mining for Security Applications∗ Kulesh Shanmugasundaram November 10 2001 1 Introduction CCS to meet with researchers who are interested in applying data mining techniques to security applica- This is a summary of discussions at Workshop on tions and discuss critical issues of mutual interest Data Mining for Security Applications CCS’01, PA. during a concentrated period.” In this document data mining takes a broad meaning which may, sometimes, include machine learning Two fundamental questions were asked and mostly (ML) and artificial intelligence (AI). Furthermore, went unanswered in our discussions. forensics and intrusion detection are interchangeable in some contexts. Please note it is beyond the scope of our discussions to provide better definitions to 1. Are we trying to solve the right security prob- these terms. lem(s)? Are denial of service and intrusion detection right problems for data mining or are there any It is noted in early discussions that most data min- other security problems where data mining could ing solutions avoid recognition of simple, effective be more effective– such as cryptanalysis–, useful substitutes in place of sophisticated, computation- and perhaps preventive? It was suggested foren- ally intensive data mining techniques. One of the sics is one of the fields that can use data mining suggestions is that future performance comparisons for effective data reduction and for learning new should include some of these simple, effective meth- insights or patterns. There were no other sug- ods where appropriate. Also noted that most data gestions. mining techniques do not emphasize enough on pre- processing and post-processing of datasets. It was emphasized that we should use machine learning tech- 2. Do security problems need development of new niques to fine tune input datasets before data mining data mining techniques? and should automate decision making after data min- It is hard to answer this question without answer- ing. Authors demonstrated use of data mining for ing the previous question. We have not yet iden- intrusion detection, for identifying denial of service tified security problems that mandate a whole attacks and for forensics. new data mining approach. However, it was felt strongly among the panel that new data mining techniques will have to be investigated in near 2 Questions... future. One of the suggestions was to investigate techniques used in bio-informatics to solve secu- Traditionally data mining has solved problems in rity problems– especially intrusion detection. database systems and bio-informatics– where data mining techniques are still being used successfully to map genome– and financial engineering. Recently 3 Ideas & Opinions data mining community started applying similar techniques to existing security problems. “This event Following is a collection of ideas and opinions that provides an opportunity for attendees of the ACM came out of this workshop. Most of them revolve ∗ Feel free to edit this document but please let me know around security problems for which data mining can what you did so that I can keep my copy fresh. Thanks! provide solutions. Page 1
  2. 2. 3.1 Fusion of Information 3.4 Gene Coding Applications As networks and network sensors become ubiquitous Idea of gene coding applications is similar to gene fusion of sensor information is critical to the devel- coding bio-organisms. Tools and methods should be opment of accurate insights on incidents. Therefore, developed to extract application behaviors at differ- fusion of sensor information and infrastructure devel- ent levels of software engineering process and embed opment to support fusion are an important areas for these behaviors along with application code. Upon research and development. For instance, stream min- execution of a gene coded application, application ing techniques can be used to develop tools that can level firewalls and intrusion detection systems use give better overview of network traffic in [near] real embedded gene code of the application to detect time[2]. Network forensics is another area well po- anomalous behaviors. sitioned to benefit from fusion of information. One of many problems with information fusion is lack It seems quite obvious network is not the best place of industry support in adopting a common standard to perform effective filtering. There is too much noise for intrusion message exchange. However, IETF and on the network; To do any effective filtering means TRENA are working together on couple of standards first filtering nosie out and then focusing on interest- for intrusion message exchange. ing signals. However, host based detection methods in comparison are much more effective in that there is less noise. We can, however, raise the bar further 3.2 Rule Generation & Data Reduc- by deploying detection methods at applications them- selves. New methods should be developed which al- tion low application developers to characterize “normal” Automated rule generation for intrusion detection behaviors of applications and package that informa- systems [to identify new threats,] rule generation for tion as part of application code. Intrusion detection data mining systems to filter datasets efficiently and systems and firewalls can then rely on this informa- data reduction in data mining systems without lose tion to model anomalous behaviors. of critical information are still in primitive stages of development. Research and development effort must 3.5 Feature Selection of Attacks be put in to develop better automated rule genera- tion methods. False alarm filtering is considered an Feature selection of attacks is an important ele- open problem. It was mentioned most commercial ment to intrusion detection systems. Currently there IDS products produce as much as 80% false positives. are not many useful feature selection, categorization New methods are required to filter false alarms with- methods available[14]. Such selection criteria would out leaving way to stealth attacks. That is, an at- allow real time attack profiling and adaptive attack tacker may trigger high volume of false alarms and containment by intrusion detection systems. if IDS reacts by filtering out that false alarm the at- tacker can now by pass the IDS without triggering 3.6 Lack of Data Visualization the alarm. New methods should reduce false alarms but should avoid such attacks as well. There is a lack of data visualization tools for network applications and forensics. Development of data vi- sualization tools with single data multiple perspective 3.3 Automated Ruleset Propagation is an immediate necessary. Some form of certification should be developed to certify forensic tools such ab- One of the problems still not addressed by IDS ven- stractions are not altering evidence and such abstrac- dors is how to propagate rulesets or attack signatures tions are actually “telling the truth.” An open source securely over networks. An automated update strat- forensic data visualization library seems to be a good egy, through an overlay network approach, should al- starting point for such a certification process. low intrusion detection systems to be more adaptive. Currently RealSecure ( 3.7 Lack of Datasets is the only system that supports anti virus like method to update rulesets from a central server. It is a great concern of the community that lacks of However, updating rulesets in a heterogeneous net- realistic test datasets are making the research uncer- work means more than connecting to a central server tain. What works on test dataset may not work prop- and downloading new rule sets. erly in real datasets; on the other hand, methods that Page 2
  3. 3. are not efficient on test datasets may turn out to be effective on real data. Therefore, there is immediate need for a tool or a network infrastructure to collect real datasets for the community while maintaining privacy standards. References [1] Critical Thoughts on Contemporary Data Min- ing Research for Security Applications, Klaus Julisch [2] Fusing Heterogeneous Alert Streams into Scenar- ios, Oliver M Dain, Robert K. Cunningham [3] Using MIB II Variables For Network Anomaly Detection- A Feasibility Study, Xinzhou Qin et. al [4] Intrusion Detection with Unlabled Data Using Clustering, Leonid Portnoy et. al [5] Multi-Topic Email Authorship Attribution Forensics [6] Panel discussions in and out of Sonata -03 [7] An Intrusion Detection System Based on the Teiresisas Pattern Discovery Algorithm, Andreas Wespi et. al [8] The GeneMine system for genome/proteome an- notation and collaborative data mining [9] Mining High Speed Data Streams, Pedro Domin- gos, Geoff Hulten [10] Dr. Sushil Jajodia˜csis/faculty/jajodia.html [11] Johannes Gehrke [12] Wenke Lee˜wenke/ [13] Philip Chan˜pkc/ [14] Columbia IDS Group Page 3