SlideShare a Scribd company logo
1 of 2
Download to read offline
Are Evolutionary Rule Learning Algorithms Appropriate for
Malware Detection?
M. Zubair Shafiq, S. Momina Tabish, Muddassar Farooq
Next Generation Intelligent Networks Research Center (nexGIN RC)
National University of Computer & Emerging Sciences (FAST-NUCES)
Islamabad, 44000, Pakistan
{zubair.shafiq,momina.tabish,muddassar.farooq}@nexginrc.org
ABSTRACT
In this paper, we evaluate the performance of ten well-known
evolutionary and non-evolutionary rule learning algorithms.
The comparative study is performed on a real-world clas-
sification problem of detecting malicious executables. The
executable dataset, used in this study, consists of a total of
189 attributes which are statically extracted from the ex-
ecutables of Microsoft Windows operating system. In our
study, we evaluate the performance of rule learning algo-
rithms with respect to four metrics: (1) classification ac-
curacy, (2) the number of rules in the developed rule set,
(3) the comprehensibility of the generated rules, and (4) the
processing overhead of the rule learning process. The results
of our study highlight important shortcomings in evolution-
ary rule learning classifiers that render them infeasible for
deployment in a real-world malware detection system.
Categories and Subject Descriptors
D.4.6 [Software]: Security and Protection—Invasive soft-
ware
General Terms
Algorithms, Experimentation, Security
Keywords
Genetics Based Machine Learning, Malware Detection
1. INTRODUCTION
The genetic based machine learning systems, more com-
monly known as GBML, have become significantly more so-
phisticated compared with primitive versions of the learning
classifier systems (LCS) proposed by John Holland in 1970s.
A number of LCS variants exist that include evolutionary
rule learners, evolutionary neural networks and other hy-
brid schemes. But evolutionary rule learners have been the
most popular amongst all. They can be further subdivided
into Michigan style, Pittsburgh style, anticipatory, and it-
erative rule learners [1]. Most of the recent work in LCSs
is focused on classical Michigan- and Pittsburgh-style evo-
lutionary rule learners that include but are not limited to
XCS, UCS, GAssist and GALE.
Copyright is held by the author/owner(s).
GECCO’09, July 8–12, 2009, Montréal Québec, Canada.
ACM 978-1-60558-325-9/09/07.
In this paper, we evaluate several well-known evolutionary
rule learning classification algorithms (XCS, UCS, GAssist-
ADI, GAssist-Intervalar, SLAVE) using a real-world clas-
sification problem of malware detection. We also compare
these evolutionary rule learning algorithms with five other
well-knwon non-evolutionary rule learning algorithms: RIP-
PER, SLIPPER, PART, C4.5 rules, RIONA. The malware
detection problem has several stringent constraints other
than classification accuracy, such as comprehensibility of
the developed solution (for malware forensic experts) and
processing overheads (for realtime deployment). Therefore,
the malware detection problem is relatively more challeng-
ing because accuracy and efficiency of classification must be
optimized simultaneously.
The executable dataset, used in this study, consists of
11, 786 executable files for the Microsoft Windows operating
system in Portable Executable (PE) format [3]. A total of
189 attributes are statically extracted from the executables.
The executable dataset consists of two types of executables:
benign and malicious. We have collected 1, 447 benign PE
files from the local area network of our virology lab. We have
obtained 10, 339 malicious executables from a publicly avail-
able malware database called ‘VX Heavens Virus Collection’
[4].
In order to undertake a comprehensive study, we use four
performance metrics for comparison: (1) classification accu-
racy, (2) the number of rules, (3) comprehensibility of the
rules, and (4) processing overhead. To factor out implemen-
tation related bias in our study, we use the implementations
of rule learning algorithms provided in a unified framework
called Knowledge Extraction based on Evolutionary Learn-
ing (KEEL) [2].
2. RESULTS & CONCLUSION
In this section, we provide the results of our experiments
and relevant discussions. We have used the default para-
meters of all algorithms as provided in KEEL, unless stated
otherwise. We compare the performance of classifiers with
the help of four performance metrics:
2.1 Classification Accuracy
In a typical two-class problems, such as malicious exe-
cutable detection, the classification decision of an algorithm
may fall into one of the following four categories: (1) true
positive (TP), (2) true negative (TN), (3) false positive (FP),
and (4) false negative (FN). A suitable metric for quantify-
ing classification accuracy of an algorithm, also used in this
study, is accuracy which is defined as:
Table 1: Classification accuracy of all algorithms
Algorithm Average Number Execution
Accuracy of rules time (sec)
Evolutionary Rule Based Algorithms
XCS 0.9550± 0.0057 10, 000.0 109.0
UCS 0.9847± 0.0034 9, 652.6 16.3
GAssist-ADI 0.9929±0.0013 2.6 1, 177.2
GAssist-Intervalar 0.9873± 0.0024 4.2 4, 060.9
SLAVE 0.9917±0.0013 3.3 775.1
Non-Evolutionary Rule Based Algorithms
RIPPER 0.9961± 0.0033 6.0 8.0
SLIPPER 0.9975± 0.0023 57.0 29.4
PART 0.9956± 0.0028 6.3 51.8
C4.5 rules 0.9962± 0.0031 7.4 37.0
RIONA 0.9899± 0.0052 - 2680.5
accuracy =
TP + TN
TP + TN + FP + FN
Table 1 shows the classification accuracy of all algorithms
(both evolutionary and non-evolutionary). We represent
the best evolutionary and non-evolutionary classifier with
a bold value. The best evolutionary rule learning algorithm
is GAssit-ADI, closely followed by SLAVE which also has
the average accuracy of more than 0.99. GAssist-Intervalar
and UCS have accuracies in the range of approximately 0.98.
XCS provides the worst accuracy amongst all classification
algorithms used in our study. SLIPPER outperforms rest of
the non-evolutionary rule learning algorithms. It is closely
followed by C4.5 rules, RIPPER and PART respectively.
The relatively worst classifier among non-evolutionary algo-
rithm is RIONA that provides an accuracy of approximately
0.99. It is evident from Table 1 that non-evolutionary rule
learning algorithms clearly outperform their evolutionary
counterparts in terms of accuracy.
2.2 Number of Rules
The number of rules developed by an algorithm are de-
pendent on the various input parameters. Sometimes, the
maximum possible number of rules is explicitly limited via
an input parameter. Table 1 shows that the number of rules
in XCS and UCS are limited by the MaxPopSize (=10, 000)
parameter as they try to overfit. In comparison, other evo-
lutionary rule learning algorithms utilize significantly fewer
rules for classifying a malware. Non-evolutionary rule learn-
ing algorithms also generate significantly less number of rules.
However, SLIPPER appears to be the only exception, which
generates more than fifty rules on the average.
2.3 Comprehensibility of Rules
The comprehensibility of generated rules is another im-
portant metric as it provides great help to the ‘malware
forensic experts’ in understanding the behavior and charac-
teristics of a given malware. The comprehensibility of rules
is a subjective term so we quantify it by using coarse grain
Low, Medium, and High categories.
Both XCS and UCS develop conjunctive IF-THEN rules
containing specified intervals for every attribute. Moreover,
the number of rules in the final rule set are very large (see
Table 1). Due to these obvious shortcomings, we say that the
comprehensibility of the rules generated by XCS and UCS
is Low. GAssist-ADI mostly develops rules by using single-
sided ranges and has default rules as well. The comprehensi-
bility of rules generated by GAssist-ADI can be categorized
as High. GAssist-Intervalar, similar to XCS and UCS, gener-
ates rules that contain specified intervals for every attribute.
However, the final rule set just contains only a few rules. So
the comprehensibility of the generated rules is categorized as
Medium. SLAVE uses fixed intervals (such as L1,L2,L3,L4)
to generate the fuzzy rules. It also assigns weights to the
individual rules. Overall, the comprehensibility of the rules
generated by SLAVE can be ranked as Medium. The rules
generated by all non-evolutionary rule learning algorithms
are not only simpler to comprehend but are also smaller in
number. Therefore, the comprehensibility of the rules gen-
erated by RIPPER, SLIPPER, PART and C4.5 rules can be
categorized as High.
2.4 Processing Overheads
The processing overheads are reported in Table 1 as the
average time (in seconds) required for training and test-
ing per fold of the 10-fold cross validation process. It is
evident from Table 1 that the processing overheads of all
evolutionary rule learning algorithms, except UCS, are sig-
nificantly large compared with non-evolutionary classifiers.
Malware detection systems are deployed in real operating
systems; therefore, the processing overhead is a crucial met-
ric. GAssist-ADI and GAssist-Intervalar have very large
processing overheads. The processing overheads of XCS and
SLAVE are also high. UCS is the only evolutionary rule
learning algorithm which has a processing overhead compa-
rable with non-evolutionary algorithms.
All non-evolutionary rule learning algorithms, except RI-
ONA, have acceptable processing overheads. The processing
overhead of RIONA is large because it uses a combination of
instance based learning and rule induction. For every given
test instance, it first selects the neighboring instances simi-
lar to the nearest neighbor algorithm. After selection, it de-
velops rule sets through rule induction. The complexity in-
curred due to exhaustive search of the neighboring instances
for every test instance is a major performance bottleneck.
2.5 Conclusions
In this paper, we have compared ten rule learning algo-
rithms for a real-world problem of executable malware de-
tection. The results of our study show that non-evolutionary
rule learning algorithms clearly outperform evolutionary rule
learning algorithms for every performance metric.
3. REFERENCES
[1] J.H. Holland, L.B. Booker, M. Colombetti, M. Dorigo,
D.E. Goldberg, S. Forrest, R.L. Riolo, R.E. Smith, P.L.
Lanzi, W. Stolzmann, S.W. Wilson, “What Is a Learning
Classifier System?”, Internatinoal Workshop on Learning
Classifier Systems (IWLCS), Volume 1813 of Lecture Notes
in Artificial Intelligence, pp. 3-32, Springer, 2000.
[2] J. Alcala-Fdez, L. Sanchez, S. Garcia, M.J. del Jesus, S.
Ventura, J.M. Garrell, J. Otero, C. Romero, J. Bacardit,
V.M. Rivas, J.C. Fernandez, F. Herrera, “KEEL: a software
tool to assess evolutionary algorithms for data mining
problems”, Soft Computing, Volume 13, pp. 307-318,
Springer, 2009.
[3] Microsoft Portable Executable and Common Object File
Format Specification, Windows Hardware Developer
Central, Updated March 2008.
[4] VX Heavens Virus Collection, VX Heavens website,
available at http://vx.netlux.org.

More Related Content

Viewers also liked

Are metamorphic viruses really invincible
Are metamorphic viruses really invincibleAre metamorphic viruses really invincible
Are metamorphic viruses really invincibleUltraUploader
 
Are handheld viruses a significant threat
Are handheld viruses a significant threatAre handheld viruses a significant threat
Are handheld viruses a significant threatUltraUploader
 
Attack of the killer virus!
Attack of the killer virus!Attack of the killer virus!
Attack of the killer virus!UltraUploader
 
Approaches to integrated malware detection and avoidance
Approaches to integrated malware detection and avoidanceApproaches to integrated malware detection and avoidance
Approaches to integrated malware detection and avoidanceUltraUploader
 
Are the current computer crime laws sufficient or should the writing of virus...
Are the current computer crime laws sufficient or should the writing of virus...Are the current computer crime laws sufficient or should the writing of virus...
Are the current computer crime laws sufficient or should the writing of virus...UltraUploader
 
Asm based modelling of self-replicating programs
Asm based modelling of self-replicating programsAsm based modelling of self-replicating programs
Asm based modelling of self-replicating programsUltraUploader
 
Applied parallel coordinates for logs and network traffic attack analysis
Applied parallel coordinates for logs and network traffic attack analysisApplied parallel coordinates for logs and network traffic attack analysis
Applied parallel coordinates for logs and network traffic attack analysisUltraUploader
 
Applications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creationApplications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creationUltraUploader
 
Architecture of a morphological malware detector
Architecture of a morphological malware detectorArchitecture of a morphological malware detector
Architecture of a morphological malware detectorUltraUploader
 
Artificial immune system against viral attack
Artificial immune system against viral attackArtificial immune system against viral attack
Artificial immune system against viral attackUltraUploader
 
Approaching zero the extraordinary underworld of hackers, phreakers, virus ...
Approaching zero   the extraordinary underworld of hackers, phreakers, virus ...Approaching zero   the extraordinary underworld of hackers, phreakers, virus ...
Approaching zero the extraordinary underworld of hackers, phreakers, virus ...UltraUploader
 
Artificial immune systems and the grand challenge for non classical computation
Artificial immune systems and the grand challenge for non classical computationArtificial immune systems and the grand challenge for non classical computation
Artificial immune systems and the grand challenge for non classical computationUltraUploader
 
Attack of the quantum worms
Attack of the quantum wormsAttack of the quantum worms
Attack of the quantum wormsUltraUploader
 
Last Picture Show:Frank Sinatra
Last Picture Show:Frank Sinatra Last Picture Show:Frank Sinatra
Last Picture Show:Frank Sinatra Mary Lisa Gavenas
 
Historia de la alimentacion
Historia de la alimentacionHistoria de la alimentacion
Historia de la alimentacionPatricia Bernal
 
Automated classification and analysis of internet malware
Automated classification and analysis of internet malwareAutomated classification and analysis of internet malware
Automated classification and analysis of internet malwareUltraUploader
 
Autoimmune computer virus
Autoimmune computer virusAutoimmune computer virus
Autoimmune computer virusUltraUploader
 
Affiliate appsindonesia.co.id
Affiliate appsindonesia.co.idAffiliate appsindonesia.co.id
Affiliate appsindonesia.co.idMuhammad Annur
 

Viewers also liked (19)

Are metamorphic viruses really invincible
Are metamorphic viruses really invincibleAre metamorphic viruses really invincible
Are metamorphic viruses really invincible
 
Are handheld viruses a significant threat
Are handheld viruses a significant threatAre handheld viruses a significant threat
Are handheld viruses a significant threat
 
Attack of the killer virus!
Attack of the killer virus!Attack of the killer virus!
Attack of the killer virus!
 
Approaches to integrated malware detection and avoidance
Approaches to integrated malware detection and avoidanceApproaches to integrated malware detection and avoidance
Approaches to integrated malware detection and avoidance
 
Are the current computer crime laws sufficient or should the writing of virus...
Are the current computer crime laws sufficient or should the writing of virus...Are the current computer crime laws sufficient or should the writing of virus...
Are the current computer crime laws sufficient or should the writing of virus...
 
Asm based modelling of self-replicating programs
Asm based modelling of self-replicating programsAsm based modelling of self-replicating programs
Asm based modelling of self-replicating programs
 
Applied parallel coordinates for logs and network traffic attack analysis
Applied parallel coordinates for logs and network traffic attack analysisApplied parallel coordinates for logs and network traffic attack analysis
Applied parallel coordinates for logs and network traffic attack analysis
 
Applications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creationApplications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creation
 
Architecture of a morphological malware detector
Architecture of a morphological malware detectorArchitecture of a morphological malware detector
Architecture of a morphological malware detector
 
Artificial immune system against viral attack
Artificial immune system against viral attackArtificial immune system against viral attack
Artificial immune system against viral attack
 
Approaching zero the extraordinary underworld of hackers, phreakers, virus ...
Approaching zero   the extraordinary underworld of hackers, phreakers, virus ...Approaching zero   the extraordinary underworld of hackers, phreakers, virus ...
Approaching zero the extraordinary underworld of hackers, phreakers, virus ...
 
Artificial immune systems and the grand challenge for non classical computation
Artificial immune systems and the grand challenge for non classical computationArtificial immune systems and the grand challenge for non classical computation
Artificial immune systems and the grand challenge for non classical computation
 
Attack of the quantum worms
Attack of the quantum wormsAttack of the quantum worms
Attack of the quantum worms
 
Last Picture Show:Frank Sinatra
Last Picture Show:Frank Sinatra Last Picture Show:Frank Sinatra
Last Picture Show:Frank Sinatra
 
Drugeruptions
DrugeruptionsDrugeruptions
Drugeruptions
 
Historia de la alimentacion
Historia de la alimentacionHistoria de la alimentacion
Historia de la alimentacion
 
Automated classification and analysis of internet malware
Automated classification and analysis of internet malwareAutomated classification and analysis of internet malware
Automated classification and analysis of internet malware
 
Autoimmune computer virus
Autoimmune computer virusAutoimmune computer virus
Autoimmune computer virus
 
Affiliate appsindonesia.co.id
Affiliate appsindonesia.co.idAffiliate appsindonesia.co.id
Affiliate appsindonesia.co.id
 

Similar to Are evolutionary rule learning algorithms appropriate for malware detection

rule refinement in inductive knowledge based systems
rule refinement in inductive knowledge based systemsrule refinement in inductive knowledge based systems
rule refinement in inductive knowledge based systemsarteimi
 
Lazy Association Classification
Lazy Association ClassificationLazy Association Classification
Lazy Association ClassificationJason Yang
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 
An integrated mechanism for feature selection
An integrated mechanism for feature selectionAn integrated mechanism for feature selection
An integrated mechanism for feature selectionsai kumar
 
Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project finalCraig Cannon
 
Iaetsd protecting privacy preserving for cost effective adaptive actions
Iaetsd protecting  privacy preserving for cost effective adaptive actionsIaetsd protecting  privacy preserving for cost effective adaptive actions
Iaetsd protecting privacy preserving for cost effective adaptive actionsIaetsd Iaetsd
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsinfopapers
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1  Feature SetOptimal Feature Selection from VMware ESXi 5.1  Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
Sound Empirical Evidence in Software Testing
Sound Empirical Evidence in Software TestingSound Empirical Evidence in Software Testing
Sound Empirical Evidence in Software TestingJaguaraci Silva
 
Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityGon-soo Moon
 
Adapting New Data In Intrusion Detection Systems
Adapting New Data In Intrusion Detection SystemsAdapting New Data In Intrusion Detection Systems
Adapting New Data In Intrusion Detection SystemsCSCJournals
 
Final Report
Final ReportFinal Report
Final Reportimu409
 
Alex Korbonits, Data Scientist, Remitly, at MLconf Seattle 2017
Alex Korbonits, Data Scientist, Remitly, at MLconf Seattle 2017Alex Korbonits, Data Scientist, Remitly, at MLconf Seattle 2017
Alex Korbonits, Data Scientist, Remitly, at MLconf Seattle 2017MLconf
 
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...IEEEFINALYEARSTUDENTPROJECTS
 
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...IEEEMEMTECHSTUDENTSPROJECTS
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine LearningUpekha Vandebona
 

Similar to Are evolutionary rule learning algorithms appropriate for malware detection (20)

rule refinement in inductive knowledge based systems
rule refinement in inductive knowledge based systemsrule refinement in inductive knowledge based systems
rule refinement in inductive knowledge based systems
 
Lazy Association Classification
Lazy Association ClassificationLazy Association Classification
Lazy Association Classification
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
An integrated mechanism for feature selection
An integrated mechanism for feature selectionAn integrated mechanism for feature selection
An integrated mechanism for feature selection
 
Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project final
 
Iaetsd protecting privacy preserving for cost effective adaptive actions
Iaetsd protecting  privacy preserving for cost effective adaptive actionsIaetsd protecting  privacy preserving for cost effective adaptive actions
Iaetsd protecting privacy preserving for cost effective adaptive actions
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1  Feature SetOptimal Feature Selection from VMware ESXi 5.1  Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Sound Empirical Evidence in Software Testing
Sound Empirical Evidence in Software TestingSound Empirical Evidence in Software Testing
Sound Empirical Evidence in Software Testing
 
Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-Severity
 
Adapting New Data In Intrusion Detection Systems
Adapting New Data In Intrusion Detection SystemsAdapting New Data In Intrusion Detection Systems
Adapting New Data In Intrusion Detection Systems
 
Ijetr042111
Ijetr042111Ijetr042111
Ijetr042111
 
Final Report
Final ReportFinal Report
Final Report
 
Alex Korbonits, Data Scientist, Remitly, at MLconf Seattle 2017
Alex Korbonits, Data Scientist, Remitly, at MLconf Seattle 2017Alex Korbonits, Data Scientist, Remitly, at MLconf Seattle 2017
Alex Korbonits, Data Scientist, Remitly, at MLconf Seattle 2017
 
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
 
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 

More from UltraUploader

01 le 10 regole dell'hacking
01   le 10 regole dell'hacking01   le 10 regole dell'hacking
01 le 10 regole dell'hackingUltraUploader
 
00 the big guide sz (by dr.to-d)
00   the big guide sz (by dr.to-d)00   the big guide sz (by dr.to-d)
00 the big guide sz (by dr.to-d)UltraUploader
 
[E book ita] php manual
[E book   ita] php manual[E book   ita] php manual
[E book ita] php manualUltraUploader
 
[Ebook ita - security] introduzione alle tecniche di exploit - mori - ifoa ...
[Ebook   ita - security] introduzione alle tecniche di exploit - mori - ifoa ...[Ebook   ita - security] introduzione alle tecniche di exploit - mori - ifoa ...
[Ebook ita - security] introduzione alle tecniche di exploit - mori - ifoa ...UltraUploader
 
[Ebook ita - database] access 2000 manuale
[Ebook   ita - database] access 2000 manuale[Ebook   ita - database] access 2000 manuale
[Ebook ita - database] access 2000 manualeUltraUploader
 
(E book) cracking & hacking tutorial 1000 pagine (ita)
(E book) cracking & hacking tutorial 1000 pagine (ita)(E book) cracking & hacking tutorial 1000 pagine (ita)
(E book) cracking & hacking tutorial 1000 pagine (ita)UltraUploader
 
(Ebook ita - inform - access) guida al database access (doc)
(Ebook   ita - inform - access) guida al database access (doc)(Ebook   ita - inform - access) guida al database access (doc)
(Ebook ita - inform - access) guida al database access (doc)UltraUploader
 
(Ebook computer - ita - pdf) fondamenti di informatica - teoria
(Ebook   computer - ita - pdf) fondamenti di informatica - teoria(Ebook   computer - ita - pdf) fondamenti di informatica - teoria
(Ebook computer - ita - pdf) fondamenti di informatica - teoriaUltraUploader
 
Broadband network virus detection system based on bypass monitor
Broadband network virus detection system based on bypass monitorBroadband network virus detection system based on bypass monitor
Broadband network virus detection system based on bypass monitorUltraUploader
 
Botnetsand applications
Botnetsand applicationsBotnetsand applications
Botnetsand applicationsUltraUploader
 
Bot software spreads, causes new worries
Bot software spreads, causes new worriesBot software spreads, causes new worries
Bot software spreads, causes new worriesUltraUploader
 
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...UltraUploader
 
Bird binary interpretation using runtime disassembly
Bird binary interpretation using runtime disassemblyBird binary interpretation using runtime disassembly
Bird binary interpretation using runtime disassemblyUltraUploader
 
Biologically inspired defenses against computer viruses
Biologically inspired defenses against computer virusesBiologically inspired defenses against computer viruses
Biologically inspired defenses against computer virusesUltraUploader
 
Biological versus computer viruses
Biological versus computer virusesBiological versus computer viruses
Biological versus computer virusesUltraUploader
 
Biological aspects of computer virology
Biological aspects of computer virologyBiological aspects of computer virology
Biological aspects of computer virologyUltraUploader
 
Biological models of security for virus propagation in computer networks
Biological models of security for virus propagation in computer networksBiological models of security for virus propagation in computer networks
Biological models of security for virus propagation in computer networksUltraUploader
 

More from UltraUploader (20)

1 (1)
1 (1)1 (1)
1 (1)
 
01 intro
01 intro01 intro
01 intro
 
01 le 10 regole dell'hacking
01   le 10 regole dell'hacking01   le 10 regole dell'hacking
01 le 10 regole dell'hacking
 
00 the big guide sz (by dr.to-d)
00   the big guide sz (by dr.to-d)00   the big guide sz (by dr.to-d)
00 the big guide sz (by dr.to-d)
 
[E book ita] php manual
[E book   ita] php manual[E book   ita] php manual
[E book ita] php manual
 
[Ebook ita - security] introduzione alle tecniche di exploit - mori - ifoa ...
[Ebook   ita - security] introduzione alle tecniche di exploit - mori - ifoa ...[Ebook   ita - security] introduzione alle tecniche di exploit - mori - ifoa ...
[Ebook ita - security] introduzione alle tecniche di exploit - mori - ifoa ...
 
[Ebook ita - database] access 2000 manuale
[Ebook   ita - database] access 2000 manuale[Ebook   ita - database] access 2000 manuale
[Ebook ita - database] access 2000 manuale
 
(E book) cracking & hacking tutorial 1000 pagine (ita)
(E book) cracking & hacking tutorial 1000 pagine (ita)(E book) cracking & hacking tutorial 1000 pagine (ita)
(E book) cracking & hacking tutorial 1000 pagine (ita)
 
(Ebook ita - inform - access) guida al database access (doc)
(Ebook   ita - inform - access) guida al database access (doc)(Ebook   ita - inform - access) guida al database access (doc)
(Ebook ita - inform - access) guida al database access (doc)
 
(Ebook computer - ita - pdf) fondamenti di informatica - teoria
(Ebook   computer - ita - pdf) fondamenti di informatica - teoria(Ebook   computer - ita - pdf) fondamenti di informatica - teoria
(Ebook computer - ita - pdf) fondamenti di informatica - teoria
 
Broadband network virus detection system based on bypass monitor
Broadband network virus detection system based on bypass monitorBroadband network virus detection system based on bypass monitor
Broadband network virus detection system based on bypass monitor
 
Botnetsand applications
Botnetsand applicationsBotnetsand applications
Botnetsand applications
 
Bot software spreads, causes new worries
Bot software spreads, causes new worriesBot software spreads, causes new worries
Bot software spreads, causes new worries
 
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
 
Blast off!
Blast off!Blast off!
Blast off!
 
Bird binary interpretation using runtime disassembly
Bird binary interpretation using runtime disassemblyBird binary interpretation using runtime disassembly
Bird binary interpretation using runtime disassembly
 
Biologically inspired defenses against computer viruses
Biologically inspired defenses against computer virusesBiologically inspired defenses against computer viruses
Biologically inspired defenses against computer viruses
 
Biological versus computer viruses
Biological versus computer virusesBiological versus computer viruses
Biological versus computer viruses
 
Biological aspects of computer virology
Biological aspects of computer virologyBiological aspects of computer virology
Biological aspects of computer virology
 
Biological models of security for virus propagation in computer networks
Biological models of security for virus propagation in computer networksBiological models of security for virus propagation in computer networks
Biological models of security for virus propagation in computer networks
 

Are evolutionary rule learning algorithms appropriate for malware detection

  • 1. Are Evolutionary Rule Learning Algorithms Appropriate for Malware Detection? M. Zubair Shafiq, S. Momina Tabish, Muddassar Farooq Next Generation Intelligent Networks Research Center (nexGIN RC) National University of Computer & Emerging Sciences (FAST-NUCES) Islamabad, 44000, Pakistan {zubair.shafiq,momina.tabish,muddassar.farooq}@nexginrc.org ABSTRACT In this paper, we evaluate the performance of ten well-known evolutionary and non-evolutionary rule learning algorithms. The comparative study is performed on a real-world clas- sification problem of detecting malicious executables. The executable dataset, used in this study, consists of a total of 189 attributes which are statically extracted from the ex- ecutables of Microsoft Windows operating system. In our study, we evaluate the performance of rule learning algo- rithms with respect to four metrics: (1) classification ac- curacy, (2) the number of rules in the developed rule set, (3) the comprehensibility of the generated rules, and (4) the processing overhead of the rule learning process. The results of our study highlight important shortcomings in evolution- ary rule learning classifiers that render them infeasible for deployment in a real-world malware detection system. Categories and Subject Descriptors D.4.6 [Software]: Security and Protection—Invasive soft- ware General Terms Algorithms, Experimentation, Security Keywords Genetics Based Machine Learning, Malware Detection 1. INTRODUCTION The genetic based machine learning systems, more com- monly known as GBML, have become significantly more so- phisticated compared with primitive versions of the learning classifier systems (LCS) proposed by John Holland in 1970s. A number of LCS variants exist that include evolutionary rule learners, evolutionary neural networks and other hy- brid schemes. But evolutionary rule learners have been the most popular amongst all. They can be further subdivided into Michigan style, Pittsburgh style, anticipatory, and it- erative rule learners [1]. Most of the recent work in LCSs is focused on classical Michigan- and Pittsburgh-style evo- lutionary rule learners that include but are not limited to XCS, UCS, GAssist and GALE. Copyright is held by the author/owner(s). GECCO’09, July 8–12, 2009, Montréal Québec, Canada. ACM 978-1-60558-325-9/09/07. In this paper, we evaluate several well-known evolutionary rule learning classification algorithms (XCS, UCS, GAssist- ADI, GAssist-Intervalar, SLAVE) using a real-world clas- sification problem of malware detection. We also compare these evolutionary rule learning algorithms with five other well-knwon non-evolutionary rule learning algorithms: RIP- PER, SLIPPER, PART, C4.5 rules, RIONA. The malware detection problem has several stringent constraints other than classification accuracy, such as comprehensibility of the developed solution (for malware forensic experts) and processing overheads (for realtime deployment). Therefore, the malware detection problem is relatively more challeng- ing because accuracy and efficiency of classification must be optimized simultaneously. The executable dataset, used in this study, consists of 11, 786 executable files for the Microsoft Windows operating system in Portable Executable (PE) format [3]. A total of 189 attributes are statically extracted from the executables. The executable dataset consists of two types of executables: benign and malicious. We have collected 1, 447 benign PE files from the local area network of our virology lab. We have obtained 10, 339 malicious executables from a publicly avail- able malware database called ‘VX Heavens Virus Collection’ [4]. In order to undertake a comprehensive study, we use four performance metrics for comparison: (1) classification accu- racy, (2) the number of rules, (3) comprehensibility of the rules, and (4) processing overhead. To factor out implemen- tation related bias in our study, we use the implementations of rule learning algorithms provided in a unified framework called Knowledge Extraction based on Evolutionary Learn- ing (KEEL) [2]. 2. RESULTS & CONCLUSION In this section, we provide the results of our experiments and relevant discussions. We have used the default para- meters of all algorithms as provided in KEEL, unless stated otherwise. We compare the performance of classifiers with the help of four performance metrics: 2.1 Classification Accuracy In a typical two-class problems, such as malicious exe- cutable detection, the classification decision of an algorithm may fall into one of the following four categories: (1) true positive (TP), (2) true negative (TN), (3) false positive (FP), and (4) false negative (FN). A suitable metric for quantify- ing classification accuracy of an algorithm, also used in this study, is accuracy which is defined as:
  • 2. Table 1: Classification accuracy of all algorithms Algorithm Average Number Execution Accuracy of rules time (sec) Evolutionary Rule Based Algorithms XCS 0.9550± 0.0057 10, 000.0 109.0 UCS 0.9847± 0.0034 9, 652.6 16.3 GAssist-ADI 0.9929±0.0013 2.6 1, 177.2 GAssist-Intervalar 0.9873± 0.0024 4.2 4, 060.9 SLAVE 0.9917±0.0013 3.3 775.1 Non-Evolutionary Rule Based Algorithms RIPPER 0.9961± 0.0033 6.0 8.0 SLIPPER 0.9975± 0.0023 57.0 29.4 PART 0.9956± 0.0028 6.3 51.8 C4.5 rules 0.9962± 0.0031 7.4 37.0 RIONA 0.9899± 0.0052 - 2680.5 accuracy = TP + TN TP + TN + FP + FN Table 1 shows the classification accuracy of all algorithms (both evolutionary and non-evolutionary). We represent the best evolutionary and non-evolutionary classifier with a bold value. The best evolutionary rule learning algorithm is GAssit-ADI, closely followed by SLAVE which also has the average accuracy of more than 0.99. GAssist-Intervalar and UCS have accuracies in the range of approximately 0.98. XCS provides the worst accuracy amongst all classification algorithms used in our study. SLIPPER outperforms rest of the non-evolutionary rule learning algorithms. It is closely followed by C4.5 rules, RIPPER and PART respectively. The relatively worst classifier among non-evolutionary algo- rithm is RIONA that provides an accuracy of approximately 0.99. It is evident from Table 1 that non-evolutionary rule learning algorithms clearly outperform their evolutionary counterparts in terms of accuracy. 2.2 Number of Rules The number of rules developed by an algorithm are de- pendent on the various input parameters. Sometimes, the maximum possible number of rules is explicitly limited via an input parameter. Table 1 shows that the number of rules in XCS and UCS are limited by the MaxPopSize (=10, 000) parameter as they try to overfit. In comparison, other evo- lutionary rule learning algorithms utilize significantly fewer rules for classifying a malware. Non-evolutionary rule learn- ing algorithms also generate significantly less number of rules. However, SLIPPER appears to be the only exception, which generates more than fifty rules on the average. 2.3 Comprehensibility of Rules The comprehensibility of generated rules is another im- portant metric as it provides great help to the ‘malware forensic experts’ in understanding the behavior and charac- teristics of a given malware. The comprehensibility of rules is a subjective term so we quantify it by using coarse grain Low, Medium, and High categories. Both XCS and UCS develop conjunctive IF-THEN rules containing specified intervals for every attribute. Moreover, the number of rules in the final rule set are very large (see Table 1). Due to these obvious shortcomings, we say that the comprehensibility of the rules generated by XCS and UCS is Low. GAssist-ADI mostly develops rules by using single- sided ranges and has default rules as well. The comprehensi- bility of rules generated by GAssist-ADI can be categorized as High. GAssist-Intervalar, similar to XCS and UCS, gener- ates rules that contain specified intervals for every attribute. However, the final rule set just contains only a few rules. So the comprehensibility of the generated rules is categorized as Medium. SLAVE uses fixed intervals (such as L1,L2,L3,L4) to generate the fuzzy rules. It also assigns weights to the individual rules. Overall, the comprehensibility of the rules generated by SLAVE can be ranked as Medium. The rules generated by all non-evolutionary rule learning algorithms are not only simpler to comprehend but are also smaller in number. Therefore, the comprehensibility of the rules gen- erated by RIPPER, SLIPPER, PART and C4.5 rules can be categorized as High. 2.4 Processing Overheads The processing overheads are reported in Table 1 as the average time (in seconds) required for training and test- ing per fold of the 10-fold cross validation process. It is evident from Table 1 that the processing overheads of all evolutionary rule learning algorithms, except UCS, are sig- nificantly large compared with non-evolutionary classifiers. Malware detection systems are deployed in real operating systems; therefore, the processing overhead is a crucial met- ric. GAssist-ADI and GAssist-Intervalar have very large processing overheads. The processing overheads of XCS and SLAVE are also high. UCS is the only evolutionary rule learning algorithm which has a processing overhead compa- rable with non-evolutionary algorithms. All non-evolutionary rule learning algorithms, except RI- ONA, have acceptable processing overheads. The processing overhead of RIONA is large because it uses a combination of instance based learning and rule induction. For every given test instance, it first selects the neighboring instances simi- lar to the nearest neighbor algorithm. After selection, it de- velops rule sets through rule induction. The complexity in- curred due to exhaustive search of the neighboring instances for every test instance is a major performance bottleneck. 2.5 Conclusions In this paper, we have compared ten rule learning algo- rithms for a real-world problem of executable malware de- tection. The results of our study show that non-evolutionary rule learning algorithms clearly outperform evolutionary rule learning algorithms for every performance metric. 3. REFERENCES [1] J.H. Holland, L.B. Booker, M. Colombetti, M. Dorigo, D.E. Goldberg, S. Forrest, R.L. Riolo, R.E. Smith, P.L. Lanzi, W. Stolzmann, S.W. Wilson, “What Is a Learning Classifier System?”, Internatinoal Workshop on Learning Classifier Systems (IWLCS), Volume 1813 of Lecture Notes in Artificial Intelligence, pp. 3-32, Springer, 2000. [2] J. Alcala-Fdez, L. Sanchez, S. Garcia, M.J. del Jesus, S. Ventura, J.M. Garrell, J. Otero, C. Romero, J. Bacardit, V.M. Rivas, J.C. Fernandez, F. Herrera, “KEEL: a software tool to assess evolutionary algorithms for data mining problems”, Soft Computing, Volume 13, pp. 307-318, Springer, 2009. [3] Microsoft Portable Executable and Common Object File Format Specification, Windows Hardware Developer Central, Updated March 2008. [4] VX Heavens Virus Collection, VX Heavens website, available at http://vx.netlux.org.