The document summarizes research on analyzing how existing anti-virus software classifies malware. It finds that anti-virus products provide labels for malware that are inconsistent across products, incomplete in covering all malware, and lack concise semantics. To address these limitations, the research proposes a new technique for classifying malware based on its behavior and system changes, and automatically grouping similar behaviors. It evaluates the approach using large and diverse malware datasets.
A framework to detect novel computer viruses via system callsUltraUploader
This document describes a framework for detecting email viruses based on system calls. It involves injecting DLLs to monitor and log system calls from an email client. The framework includes a training period where it is exposed to known viruses to derive malicious system calls, which are stored in a database. Normal email usage is also tested to identify unique virus-related system calls. This allows detection of new viruses based on abnormal system calls, without needing pre-existing virus signatures.
Optimised malware detection in digital forensicsIJNSA Journal
On the Internet, malware is one of the most serious threats to system security. Most complex issues and
problems on any systems are caused by malware and spam. Networks and systems can be accessed and
compromised by malware known as botnets, which compromise other systems through a coordinated
attack. Such malware uses anti-forensic techniques to avoid detection and investigation. To prevent systems
from the malicious activity of this malware, a new framework is required that aims to develop an optimised
technique for malware detection. Hence, this paper demonstrates new approaches to perform malware
analysis in forensic investigations and discusses how such a framework may be developed.
A feature selection and evaluation scheme for computer virus detectionUltraUploader
This document proposes a feature selection and evaluation scheme for computer virus detection using machine learning. It presents an exhaustive search method to identify generic n-gram features from virus code, selecting features that meet minimum support thresholds within and across virus families. A hierarchical feature selection process is described to obtain concise yet representative features. The evaluation method aims to simulate detecting new virus outbreaks by testing the classifier on previously unseen viruses from the same families not in the training set.
A SURVEY ON MALWARE DETECTION AND ANALYSIS TOOLSIJNSA Journal
This document summarizes a survey paper on malware detection and analysis tools. It provides an overview of different types of malware like viruses, worms, Trojans, rootkits, spyware and keyloggers. It describes techniques for malware analysis, including static analysis which examines code without execution, and dynamic analysis which analyzes behavior during execution. It also lists some limitations of static analysis and the need for dynamic analysis. Finally, it discusses various tools available for malware detection, analysis, reverse engineering and debugging.
Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...CSCJournals
Some malware are sophisticated with polymorphic techniques such as self-mutation and emulation based analysis evasion. Most anti-malware techniques are overwhelmed by the polymorphic malware threats that self-mutate with different variants at every attack. This research aims to contribute to the detection of malicious codes, especially polymorphic malware by utilizing advanced static and advanced dynamic analyses for extraction of more informative key features of a malware through code analysis, memory analysis and behavioral analysis. Correlation based feature selection algorithm will be used to transform features; i.e. filtering and selecting optimal and relevant features. A machine learning technique called K-Nearest Neighbor (K-NN) will be used for classification and detection of polymorphic malware. Evaluation of results will be based on the following measurement metrics-True Positive Rate (TPR), False Positive Rate (FPR) and the overall detection accuracy of experiments.
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...Darshan Gorasiya
Analysis of Malware Infected Systems with MapReduce, Pig, Hive, SparkSQL & Classification with Spark MLlib Gradient-boosted Tree on Big Data Platform (Hadoop)
Malware Risk Analysis on the Campus Network with Bayesian Belief NetworkIJNSA Journal
This document discusses using a Bayesian Belief Network (BBN) to analyze malware risk on a university campus network. It begins by introducing the campus network monitoring tools and SIR epidemiological model used to model malware propagation. It then provides background on BBN principles, including defining nodes, conditional probabilities, and using the network to compute joint probabilities. The document proposes applying a BBN to assess malware prevalence risk by relating threat, vulnerability, and cost impact on network assets. It aims to provide understandable risk assessments to inform decision making.
COMPARISON OF MALWARE CLASSIFICATION METHODS USING CONVOLUTIONAL NEURAL NETWO...IJNSA Journal
Malicious software is constantly being developed and improved, so detection and classification of malwareis an ever-evolving problem. Since traditional malware detection techniques fail to detect new/unknown malware, machine learning algorithms have been used to overcome this disadvantage. We present a Convolutional Neural Network (CNN) for malware type classification based on the API (Application Program Interface) calls. This research uses a database of 7107 instances of API call streams and 8 different malware types:Adware, Backdoor, Downloader, Dropper, Spyware, Trojan, Virus,Worm. We used a 1-Dimensional CNN by mapping API calls as categorical and term frequency-inverse document frequency (TF-IDF) vectors and compared the results to other classification techniques.The proposed 1-D CNN outperformed other classification techniques with 91% overall accuracy for both categorical and TF-IDF vectors.
A framework to detect novel computer viruses via system callsUltraUploader
This document describes a framework for detecting email viruses based on system calls. It involves injecting DLLs to monitor and log system calls from an email client. The framework includes a training period where it is exposed to known viruses to derive malicious system calls, which are stored in a database. Normal email usage is also tested to identify unique virus-related system calls. This allows detection of new viruses based on abnormal system calls, without needing pre-existing virus signatures.
Optimised malware detection in digital forensicsIJNSA Journal
On the Internet, malware is one of the most serious threats to system security. Most complex issues and
problems on any systems are caused by malware and spam. Networks and systems can be accessed and
compromised by malware known as botnets, which compromise other systems through a coordinated
attack. Such malware uses anti-forensic techniques to avoid detection and investigation. To prevent systems
from the malicious activity of this malware, a new framework is required that aims to develop an optimised
technique for malware detection. Hence, this paper demonstrates new approaches to perform malware
analysis in forensic investigations and discusses how such a framework may be developed.
A feature selection and evaluation scheme for computer virus detectionUltraUploader
This document proposes a feature selection and evaluation scheme for computer virus detection using machine learning. It presents an exhaustive search method to identify generic n-gram features from virus code, selecting features that meet minimum support thresholds within and across virus families. A hierarchical feature selection process is described to obtain concise yet representative features. The evaluation method aims to simulate detecting new virus outbreaks by testing the classifier on previously unseen viruses from the same families not in the training set.
A SURVEY ON MALWARE DETECTION AND ANALYSIS TOOLSIJNSA Journal
This document summarizes a survey paper on malware detection and analysis tools. It provides an overview of different types of malware like viruses, worms, Trojans, rootkits, spyware and keyloggers. It describes techniques for malware analysis, including static analysis which examines code without execution, and dynamic analysis which analyzes behavior during execution. It also lists some limitations of static analysis and the need for dynamic analysis. Finally, it discusses various tools available for malware detection, analysis, reverse engineering and debugging.
Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...CSCJournals
Some malware are sophisticated with polymorphic techniques such as self-mutation and emulation based analysis evasion. Most anti-malware techniques are overwhelmed by the polymorphic malware threats that self-mutate with different variants at every attack. This research aims to contribute to the detection of malicious codes, especially polymorphic malware by utilizing advanced static and advanced dynamic analyses for extraction of more informative key features of a malware through code analysis, memory analysis and behavioral analysis. Correlation based feature selection algorithm will be used to transform features; i.e. filtering and selecting optimal and relevant features. A machine learning technique called K-Nearest Neighbor (K-NN) will be used for classification and detection of polymorphic malware. Evaluation of results will be based on the following measurement metrics-True Positive Rate (TPR), False Positive Rate (FPR) and the overall detection accuracy of experiments.
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...Darshan Gorasiya
Analysis of Malware Infected Systems with MapReduce, Pig, Hive, SparkSQL & Classification with Spark MLlib Gradient-boosted Tree on Big Data Platform (Hadoop)
Malware Risk Analysis on the Campus Network with Bayesian Belief NetworkIJNSA Journal
This document discusses using a Bayesian Belief Network (BBN) to analyze malware risk on a university campus network. It begins by introducing the campus network monitoring tools and SIR epidemiological model used to model malware propagation. It then provides background on BBN principles, including defining nodes, conditional probabilities, and using the network to compute joint probabilities. The document proposes applying a BBN to assess malware prevalence risk by relating threat, vulnerability, and cost impact on network assets. It aims to provide understandable risk assessments to inform decision making.
COMPARISON OF MALWARE CLASSIFICATION METHODS USING CONVOLUTIONAL NEURAL NETWO...IJNSA Journal
Malicious software is constantly being developed and improved, so detection and classification of malwareis an ever-evolving problem. Since traditional malware detection techniques fail to detect new/unknown malware, machine learning algorithms have been used to overcome this disadvantage. We present a Convolutional Neural Network (CNN) for malware type classification based on the API (Application Program Interface) calls. This research uses a database of 7107 instances of API call streams and 8 different malware types:Adware, Backdoor, Downloader, Dropper, Spyware, Trojan, Virus,Worm. We used a 1-Dimensional CNN by mapping API calls as categorical and term frequency-inverse document frequency (TF-IDF) vectors and compared the results to other classification techniques.The proposed 1-D CNN outperformed other classification techniques with 91% overall accuracy for both categorical and TF-IDF vectors.
Malware Detection Module using Machine Learning Algorithms to Assist in Centr...IJNSA Journal
Malicious software is abundant in a world of innumerable computer users, who are constantly faced withthese threats from various sources like the internet, local networks and portable drives. Malware is potentially low to high risk and can cause systems to function incorrectly, steal data and even crash. Malware may be executable or system library files in the form of viruses, worms, Trojans, all aimed at breaching the security of the system and compromising user privacy. Typically, anti-virus software is based on a signature definition system which keeps updating from the internet and thus keeping track of known viruses. While this may be sufficient for home-users, a security risk from a new virus could threaten an entire enterprise network. This paper proposes a new and more sophisticated antivirus engine that can not only scan files, but also build knowledge and detect files as potential viruses. This is done by extracting system API calls made by various normal and harmful executable, and using machine learning algorithms to classify and hence, rank files on a scale of security risk. While such a system is processor heavy, it is very effective when used centrally to protect an enterprise network which maybe more prone to such threats.
Malware classification using Machine LearningJapneet Singh
Uses examples from book titled "Malware Data Science" to explain how AV companies use Machine learning to identify malware. Also, refers to open-source project "Ember" which provides a data set and python code to train and classify malware.
A zero-day (also known as 0-day) vulnerability is a computer-software vulnerability that is unknown to those who would be interested in mitigating the vulnerability.
A FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLSIJNSA Journal
Malware writers have employed various obfuscation and polymorphism techniques to thwart static analysis
approaches and bypassing antivirus tools. Dynamic analysis techniques, however, have essentially
overcome these deceits by observing the actual behaviour of the code execution. In this regard, various
methods, techniques and tools have been proposed. However, because of the diverse concepts and
strategies used in the implementation of these methods and tools, security researchers and malware
analysts find it difficult to select the required optimum tool to investigate the behaviour of a malware and to
contain the associated risk for their study. Focusing on two dynamic analysis techniques: Function Call
monitoring and Information Flow Tracking, this paper presents a comparison framework for dynamic
malware analysis tools. The framework will assist the researchers and analysts to recognize the tool’s
implementation strategy, analysis approach, system-wide analysis support and its overall handling of
binaries, helping them to select a suitable and effective one for their study and analysis.
MALWARE DETECTION USING MACHINE LEARNING ALGORITHMS AND REVERSE ENGINEERING O...IJNSA Journal
This research paper is focused on the issue of mobile application malware detection by Reverse Engineering of Android java code and use of Machine Learning algorithms. The malicious software characteristics were identified based on a collected set of total number of 1958 applications (including 996 malware applications). During research a unique set of features was chosen, then three attribute selection algorithms and five classification algorithms (Random Forest, K Nearest Neighbors, SVM, Nave Bayes and Logistic Regression) were examined to choose algorithms that would provide the most effective rate of malware detection.
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWAREIJNSA Journal
In the era of information technology and connected world, detecting malware has been a major security concern for individuals, companies and even for states. The New generation of malware samples upgraded with advanced protection mechanism such as packing, and obfuscation frustrate anti-virus solutions. API call analysis is used to identify suspicious malicious behavior thanks to its description capability of a software functionality. In this paper, we propose an effective and efficient malware detection method that uses sequential pattern mining algorithm to discover representative and discriminative API call patterns. Then, we apply three machine learning algorithms to classify malware samples. Based on the experimental results, the proposed method assures favorable results with 0.999 F-measure on a dataset including 8152 malware samples belonging to 16 families and 523 benign samples.
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWAREIJNSA Journal
In the era of information technology and connected world, detecting malware has been a major security concern for individuals, companies and even for states. The New generation of malware samples upgraded with advanced protection mechanism such as packing, and obfuscation frustrate anti-virus solutions. API call analysis is used to identify suspicious malicious behavior thanks to its description capability of a
software functionality. In this paper, we propose an effective and efficient malware detection method that uses sequential pattern mining algorithm to discover representative and discriminative API call patterns. Then, we apply three machine learning algorithms to classify malware samples. Based on the experimental results, the proposed method assures favorable results with 0.999 F-measure on a dataset including 8152
malware samples belonging to 16 families and 523 benign samples.
A memory symptom based virus detection approachUltraUploader
This document proposes a new memory symptom-based approach for virus detection. It summarizes that traditional anti-virus software relies on pattern matching which has problems like damage during the zero-day period before new virus patterns are added. The proposed approach detects viruses based on their memory usage symptoms during execution, rather than pattern matching. An experiment analyzed the memory usage curves of 109 test programs and was able to detect viruses with 97% true positive rate and only 13% false positive rate, showing the effectiveness of this symptom-based approach.
This document proposes a data mining framework to automatically detect new malicious executables. It extracts features from binaries and uses three data mining classifiers trained on these features: a rule learner, probabilistic classifier, and multi-classifier system. When evaluated on a test set, the framework detects 97.76% of new malicious binaries, more than doubling the detection rate of a signature-based method.
Contending Malware Threat using Hybrid Security ModelIRJET Journal
The document proposes a hybrid security model to combat malware threats across different types of IT systems. It analyzes positive and negative security models and their advantages and disadvantages. A hybrid model is proposed that uses a combination of whitelisting, blacklisting, firewalls, antivirus software and other tools depending on the system type. For example, corporate systems would use application whitelisting to only allow approved enterprise apps, while home systems rely more on antivirus and firewalls for flexibility. The goal is to provide effective security tailored to each system's environment and business needs.
Basic survey on malware analysis, tools and techniquesijcsa
The term malware stands for malicious software. It is a program installed on a system without the
knowledge of owner of the system. It is basically installed by the third party with the intention to steal some
private data from the system or simply just to play pranks. This in turn threatens the computer’s security,
wherein computer are used by one’s in day-to-day life as to deal with various necessities like education,
communication, hospitals, banking, entertainment etc. Different traditional techniques are used to detect
and defend these malwares like Antivirus Scanner (AVS), firewalls, etc. But today malware writers are one
step forward towards then Malware detectors. Day-by-day they write new malwares, which become a great
challenge for malware detectors. This paper focuses on basis study of malwares and various detection
techniques which can be used to detect malwares.
The document discusses an automated system called HoneyMonkey that detects exploits and malware on websites through the use of client-side honeypots. HoneyMonkey utilizes "monkey programs" that run browsers within virtual machines to mimic human web browsing. Any files created or changes made outside the browser are flagged as potential exploits. The system also tracks redirections and popups to analyze vulnerability exploitation and malware installation chains. Evaluation of HoneyMonkey showed it could successfully classify websites as excellent, bad or okay based on detected exploits and redirection behavior.
The WannaCry ransomware virus infected over 200,000 organizations in 150 countries, crippling many hospitals and other organizations. It exploited a vulnerability in Windows to encrypt files and demand ransom payments in bitcoin. While a "kill switch" was discovered that stopped the spread, many systems already infected could not be recovered without paying ransom. It highlighted the need to keep systems updated and have backups to prevent future attacks.
IRJET- Android Malware Detection using Machine LearningIRJET Journal
This document discusses using machine learning algorithms to detect Android malware. It aims to extract features from Android applications (APKs) and train machine learning models to classify APKs as malware or benign. The proposed approach extracts features from an APK's manifest file and decompiled code to identify permissions, URLs, API calls, and other indicators. Random forest classifiers are trained on a dataset of benign and malicious APKs to detect known malware families. The models can classify new APKs as either malware or benign, and if malware, identify the specific malware family. The approach aims to detect malware with high accuracy while reducing analysis time by processing multiple APKs in parallel.
The document discusses how schools are combating computer viruses. It describes how viruses spread rapidly through email and software vulnerabilities. Schools are fighting back with updated antivirus software, tech-savvy staff who monitor networks for infections, and blocking common file extensions used in viruses. While some schools experienced minor issues, most avoided serious problems. Authorities are working to prosecute virus authors as a deterrent to other hackers.
This document presents a method for detecting spyware using data mining and decision tree algorithms. Binary features are extracted from executable files using n-grams and feature reduction is applied. The reduced features are used to generate ARFF files for training a decision tree classifier. The decision tree is able to classify unknown files as spyware or benign based on their n-gram patterns. The proposed method aims to detect both known and new, unseen spyware files unlike signature-based detection methods. A prototype application is developed with a graphical user interface to scan for and detect spyware files on a system.
Given at TRISC 2010, Grapevine, Texas.
http://www.trisc.org/speakers/aditya_sood/#p
The talk sheds light on the new trends of web based malware. Technology and Insecurity goes hand in hand. With the advent of new attacks and techniques the distribution of malware through web has been increased tremendously. Browser based exploits mainly Internet Explorer have given a birth to new world of malware infection. The attackers spread malware elegantly by exploiting the vulnerabilities and drive by downloads. The infection strategies opted by attackers like malware distribution through IFRAME injections and Search Engine Optimization. In order to understand the intrinsic behavior of these web based malware a typical analysis is required to understand the logic concept working behind these web based malwares. It is necessary to dissect these malwares from bottom to top in order to control the devastating behavior. The talk will cover structured methodologies and demonstrate the static, dynamic and behavioral analysis of web malware including PCAP analytics. Demonstrations will prove the fact and necessity of web malware analysis.
The document discusses automatic deobfuscation of binary code. It presents a local semantic analysis approach that rewrites binary code in a simpler form without relying on manual identification of obfuscation patterns. The approach uses compiler optimization techniques like constant propagation, folding, and stack optimization on virtual machine handler functions to drastically simplify the obfuscated code. It is able to reduce handler functions from 100-200 instructions to at most 10 instructions within a single basic block.
Bird binary interpretation using runtime disassemblyUltraUploader
The document describes BIRD (Binary Interpretation using Runtime Disassembly), a binary analysis and instrumentation infrastructure for the Windows/x86 platform. BIRD combines static and dynamic disassembly to guarantee that every instruction in a binary is analyzed before execution. It provides services to convert binary code to assembly and insert instrumentation code without affecting program semantics. The prototype took 12 student months to develop and can successfully analyze applications like Microsoft Office, Internet Explorer, and IIS with low overhead of below 4%.
Accurately detecting source code of attacks that increase privilegeUltraUploader
The document discusses developing a system to detect source code for attacks that increase privilege before they are executed. The system separates incoming data into categories like C code or shell code. Features are extracted from each sample and used to estimate if it contains attack code. The system has been evaluated on large databases of normal and attack software written by many authors, with results showing accurate detection of attack code.
This document summarizes the authors' experience over two years fuzzing VoIP devices to discover vulnerabilities. They used their in-house tool KIF to conduct stateful protocol fuzzing on a variety of VoIP equipment. The testing uncovered many vulnerabilities related to weak input validation, including buffer overflows and format string issues. Some vulnerabilities allowed compromising internal networks by exploiting unfiltered web interfaces on VoIP phones. The authors disclosed vulnerabilities responsibly and provided mitigation techniques.
Malware Detection Module using Machine Learning Algorithms to Assist in Centr...IJNSA Journal
Malicious software is abundant in a world of innumerable computer users, who are constantly faced withthese threats from various sources like the internet, local networks and portable drives. Malware is potentially low to high risk and can cause systems to function incorrectly, steal data and even crash. Malware may be executable or system library files in the form of viruses, worms, Trojans, all aimed at breaching the security of the system and compromising user privacy. Typically, anti-virus software is based on a signature definition system which keeps updating from the internet and thus keeping track of known viruses. While this may be sufficient for home-users, a security risk from a new virus could threaten an entire enterprise network. This paper proposes a new and more sophisticated antivirus engine that can not only scan files, but also build knowledge and detect files as potential viruses. This is done by extracting system API calls made by various normal and harmful executable, and using machine learning algorithms to classify and hence, rank files on a scale of security risk. While such a system is processor heavy, it is very effective when used centrally to protect an enterprise network which maybe more prone to such threats.
Malware classification using Machine LearningJapneet Singh
Uses examples from book titled "Malware Data Science" to explain how AV companies use Machine learning to identify malware. Also, refers to open-source project "Ember" which provides a data set and python code to train and classify malware.
A zero-day (also known as 0-day) vulnerability is a computer-software vulnerability that is unknown to those who would be interested in mitigating the vulnerability.
A FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLSIJNSA Journal
Malware writers have employed various obfuscation and polymorphism techniques to thwart static analysis
approaches and bypassing antivirus tools. Dynamic analysis techniques, however, have essentially
overcome these deceits by observing the actual behaviour of the code execution. In this regard, various
methods, techniques and tools have been proposed. However, because of the diverse concepts and
strategies used in the implementation of these methods and tools, security researchers and malware
analysts find it difficult to select the required optimum tool to investigate the behaviour of a malware and to
contain the associated risk for their study. Focusing on two dynamic analysis techniques: Function Call
monitoring and Information Flow Tracking, this paper presents a comparison framework for dynamic
malware analysis tools. The framework will assist the researchers and analysts to recognize the tool’s
implementation strategy, analysis approach, system-wide analysis support and its overall handling of
binaries, helping them to select a suitable and effective one for their study and analysis.
MALWARE DETECTION USING MACHINE LEARNING ALGORITHMS AND REVERSE ENGINEERING O...IJNSA Journal
This research paper is focused on the issue of mobile application malware detection by Reverse Engineering of Android java code and use of Machine Learning algorithms. The malicious software characteristics were identified based on a collected set of total number of 1958 applications (including 996 malware applications). During research a unique set of features was chosen, then three attribute selection algorithms and five classification algorithms (Random Forest, K Nearest Neighbors, SVM, Nave Bayes and Logistic Regression) were examined to choose algorithms that would provide the most effective rate of malware detection.
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWAREIJNSA Journal
In the era of information technology and connected world, detecting malware has been a major security concern for individuals, companies and even for states. The New generation of malware samples upgraded with advanced protection mechanism such as packing, and obfuscation frustrate anti-virus solutions. API call analysis is used to identify suspicious malicious behavior thanks to its description capability of a software functionality. In this paper, we propose an effective and efficient malware detection method that uses sequential pattern mining algorithm to discover representative and discriminative API call patterns. Then, we apply three machine learning algorithms to classify malware samples. Based on the experimental results, the proposed method assures favorable results with 0.999 F-measure on a dataset including 8152 malware samples belonging to 16 families and 523 benign samples.
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWAREIJNSA Journal
In the era of information technology and connected world, detecting malware has been a major security concern for individuals, companies and even for states. The New generation of malware samples upgraded with advanced protection mechanism such as packing, and obfuscation frustrate anti-virus solutions. API call analysis is used to identify suspicious malicious behavior thanks to its description capability of a
software functionality. In this paper, we propose an effective and efficient malware detection method that uses sequential pattern mining algorithm to discover representative and discriminative API call patterns. Then, we apply three machine learning algorithms to classify malware samples. Based on the experimental results, the proposed method assures favorable results with 0.999 F-measure on a dataset including 8152
malware samples belonging to 16 families and 523 benign samples.
A memory symptom based virus detection approachUltraUploader
This document proposes a new memory symptom-based approach for virus detection. It summarizes that traditional anti-virus software relies on pattern matching which has problems like damage during the zero-day period before new virus patterns are added. The proposed approach detects viruses based on their memory usage symptoms during execution, rather than pattern matching. An experiment analyzed the memory usage curves of 109 test programs and was able to detect viruses with 97% true positive rate and only 13% false positive rate, showing the effectiveness of this symptom-based approach.
This document proposes a data mining framework to automatically detect new malicious executables. It extracts features from binaries and uses three data mining classifiers trained on these features: a rule learner, probabilistic classifier, and multi-classifier system. When evaluated on a test set, the framework detects 97.76% of new malicious binaries, more than doubling the detection rate of a signature-based method.
Contending Malware Threat using Hybrid Security ModelIRJET Journal
The document proposes a hybrid security model to combat malware threats across different types of IT systems. It analyzes positive and negative security models and their advantages and disadvantages. A hybrid model is proposed that uses a combination of whitelisting, blacklisting, firewalls, antivirus software and other tools depending on the system type. For example, corporate systems would use application whitelisting to only allow approved enterprise apps, while home systems rely more on antivirus and firewalls for flexibility. The goal is to provide effective security tailored to each system's environment and business needs.
Basic survey on malware analysis, tools and techniquesijcsa
The term malware stands for malicious software. It is a program installed on a system without the
knowledge of owner of the system. It is basically installed by the third party with the intention to steal some
private data from the system or simply just to play pranks. This in turn threatens the computer’s security,
wherein computer are used by one’s in day-to-day life as to deal with various necessities like education,
communication, hospitals, banking, entertainment etc. Different traditional techniques are used to detect
and defend these malwares like Antivirus Scanner (AVS), firewalls, etc. But today malware writers are one
step forward towards then Malware detectors. Day-by-day they write new malwares, which become a great
challenge for malware detectors. This paper focuses on basis study of malwares and various detection
techniques which can be used to detect malwares.
The document discusses an automated system called HoneyMonkey that detects exploits and malware on websites through the use of client-side honeypots. HoneyMonkey utilizes "monkey programs" that run browsers within virtual machines to mimic human web browsing. Any files created or changes made outside the browser are flagged as potential exploits. The system also tracks redirections and popups to analyze vulnerability exploitation and malware installation chains. Evaluation of HoneyMonkey showed it could successfully classify websites as excellent, bad or okay based on detected exploits and redirection behavior.
The WannaCry ransomware virus infected over 200,000 organizations in 150 countries, crippling many hospitals and other organizations. It exploited a vulnerability in Windows to encrypt files and demand ransom payments in bitcoin. While a "kill switch" was discovered that stopped the spread, many systems already infected could not be recovered without paying ransom. It highlighted the need to keep systems updated and have backups to prevent future attacks.
IRJET- Android Malware Detection using Machine LearningIRJET Journal
This document discusses using machine learning algorithms to detect Android malware. It aims to extract features from Android applications (APKs) and train machine learning models to classify APKs as malware or benign. The proposed approach extracts features from an APK's manifest file and decompiled code to identify permissions, URLs, API calls, and other indicators. Random forest classifiers are trained on a dataset of benign and malicious APKs to detect known malware families. The models can classify new APKs as either malware or benign, and if malware, identify the specific malware family. The approach aims to detect malware with high accuracy while reducing analysis time by processing multiple APKs in parallel.
The document discusses how schools are combating computer viruses. It describes how viruses spread rapidly through email and software vulnerabilities. Schools are fighting back with updated antivirus software, tech-savvy staff who monitor networks for infections, and blocking common file extensions used in viruses. While some schools experienced minor issues, most avoided serious problems. Authorities are working to prosecute virus authors as a deterrent to other hackers.
This document presents a method for detecting spyware using data mining and decision tree algorithms. Binary features are extracted from executable files using n-grams and feature reduction is applied. The reduced features are used to generate ARFF files for training a decision tree classifier. The decision tree is able to classify unknown files as spyware or benign based on their n-gram patterns. The proposed method aims to detect both known and new, unseen spyware files unlike signature-based detection methods. A prototype application is developed with a graphical user interface to scan for and detect spyware files on a system.
Given at TRISC 2010, Grapevine, Texas.
http://www.trisc.org/speakers/aditya_sood/#p
The talk sheds light on the new trends of web based malware. Technology and Insecurity goes hand in hand. With the advent of new attacks and techniques the distribution of malware through web has been increased tremendously. Browser based exploits mainly Internet Explorer have given a birth to new world of malware infection. The attackers spread malware elegantly by exploiting the vulnerabilities and drive by downloads. The infection strategies opted by attackers like malware distribution through IFRAME injections and Search Engine Optimization. In order to understand the intrinsic behavior of these web based malware a typical analysis is required to understand the logic concept working behind these web based malwares. It is necessary to dissect these malwares from bottom to top in order to control the devastating behavior. The talk will cover structured methodologies and demonstrate the static, dynamic and behavioral analysis of web malware including PCAP analytics. Demonstrations will prove the fact and necessity of web malware analysis.
The document discusses automatic deobfuscation of binary code. It presents a local semantic analysis approach that rewrites binary code in a simpler form without relying on manual identification of obfuscation patterns. The approach uses compiler optimization techniques like constant propagation, folding, and stack optimization on virtual machine handler functions to drastically simplify the obfuscated code. It is able to reduce handler functions from 100-200 instructions to at most 10 instructions within a single basic block.
Bird binary interpretation using runtime disassemblyUltraUploader
The document describes BIRD (Binary Interpretation using Runtime Disassembly), a binary analysis and instrumentation infrastructure for the Windows/x86 platform. BIRD combines static and dynamic disassembly to guarantee that every instruction in a binary is analyzed before execution. It provides services to convert binary code to assembly and insert instrumentation code without affecting program semantics. The prototype took 12 student months to develop and can successfully analyze applications like Microsoft Office, Internet Explorer, and IIS with low overhead of below 4%.
Accurately detecting source code of attacks that increase privilegeUltraUploader
The document discusses developing a system to detect source code for attacks that increase privilege before they are executed. The system separates incoming data into categories like C code or shell code. Features are extracted from each sample and used to estimate if it contains attack code. The system has been evaluated on large databases of normal and attack software written by many authors, with results showing accurate detection of attack code.
This document summarizes the authors' experience over two years fuzzing VoIP devices to discover vulnerabilities. They used their in-house tool KIF to conduct stateful protocol fuzzing on a variety of VoIP equipment. The testing uncovered many vulnerabilities related to weak input validation, including buffer overflows and format string issues. Some vulnerabilities allowed compromising internal networks by exploiting unfiltered web interfaces on VoIP phones. The authors disclosed vulnerabilities responsibly and provided mitigation techniques.
Approaching zero the extraordinary underworld of hackers, phreakers, virus ...UltraUploader
This document provides a summary of the prologue of the book "Approaching Zero" which details the story of a teenage computer hacker known as "Fry Guy". The summary is as follows:
Fry Guy is able to hack into one of the most secure computer systems in the US containing credit histories by impersonating an employee over the phone. He then uses the account information to access the system from his home computer. The prologue provides background on how Fry Guy became interested in computers and hacking from a young age and would spend hours exploring phone and computer systems on his own.
Agisa towards automatic generation of infection signaturesUltraUploader
This document describes AGIS, a system for automatically generating infection signatures to detect compromised systems. AGIS monitors the runtime behavior of suspicious code to detect infections based on security policy violations. It then uses dynamic and static analysis to identify the characteristic behaviors of the malware in terms of system/API calls. Important instructions related to the infection's malicious behavior are extracted from executables to generate signatures that can be used by scanners to detect the infection. AGIS was implemented on Windows and evaluated against real malware samples, demonstrating its ability to detect new infections and generate high-quality signatures.
This document summarizes a research paper about binary obfuscation techniques that aim to make reverse engineering of software more difficult. The paper proposes replacing control transfer instructions like jumps and calls with signals (traps) that are handled by signal handling code to perform the control transfer. It also inserts dummy control transfers and junk instructions after traps to confuse disassemblers. Experimental results show this obfuscation causes disassemblers to miss 30-80% of instructions and make mistakes on over half of control flow edges, while increasing execution time.
Antivirus software testing for the new milleniumUltraUploader
This document discusses the need for standardized testing of antivirus software to properly evaluate claims by vendors of providing "faster, better, cheaper" protection. It outlines the current state of antivirus testing, including certification programs run by ICSA, Westcoast Labs, and universities. The tests evaluate detection of viruses in the wild and ability to disinfect. The document argues for a functional approach to testing that is not specific to any vendor or product.
A week is a long time in computer ethicsUltraUploader
Over the course of a week, numerous news articles reported on various ethical issues arising from the use of technology. Issues discussed included intellectual property theft through peer-to-peer file sharing and concerns over globalization. Other topics included the growing problems of identity theft, computer viruses and hacking, junk mail and spam, censorship and surveillance. The week demonstrated the many challenges posed by technology and how each issue often has countervailing perspectives around issues like freedom of expression, prevention versus cure of computer misuse, and information access versus information control.
Viruses spread by infecting executable programs, which then infect other programs when they are run. As infected programs are executed by different users who have authority over other programs and files, the virus can propagate throughout the system. Standard protection mechanisms in time-sharing systems are not sufficient to prevent the spread of viruses in this manner.
Are the current computer crime laws sufficient or should the writing of virus...UltraUploader
This document discusses whether current computer crime laws are sufficient to address the writing of virus code or if this activity should be prohibited. It begins with background on cybercrime and viruses, defining viruses, worms, and payloads. It describes how malware is released and the threats posed by viruses and worms. It outlines current US federal and state computer crime laws and their limitations in addressing virus writing. The document argues that directly prohibiting virus writing may be needed and examines how a new statute could address this and potential issues it may raise regarding free speech.
The document provides a detailed chronology and analysis of the Morris worm, one of the earliest computer worms to spread via the Internet. It summarizes that on November 2, 1988, a self-replicating program was released that infected hundreds or thousands of computers running UNIX via vulnerabilities in sendmail, finger, and rsh/rexec. It then analyzes the worm's code to describe how it spread, hid itself, and avoided detection by system administrators as it rapidly propagated across the Internet.
Anomalous payload based network intrusion detectionUltraUploader
This document summarizes a payload-based anomaly detection system called PAYL that models normal application payloads for network traffic. PAYL computes a byte frequency distribution profile during training and uses Mahalanobis distance during detection to measure similarity to the profile. It was shown to achieve nearly 100% accuracy with a 0.1% false positive rate on some datasets. The system aims to detect new worms and exploits at a network gateway before propagation.
A survey of cryptologic issues in computer virologyUltraUploader
The document discusses how cryptographic techniques can be used maliciously in computer virology to evade antivirus detection. It covers how encryption can be used to randomly generate IP addresses for worm propagation, polymorphically mutate viral code, and armor code to prevent analysis. As an example, it analyzes how a programming error in the random number generator of the Sapphire/Slammer worm led to poor randomness and biased propagation.
The document describes research on automatically generating multiple neural network classifiers to detect unknown Win32 viruses through heuristics. Individual classifiers have too high a false positive rate, but combining outputs through voting reduces false positives to arbitrarily low levels with a slight increase in false negatives. The researchers constructed neural network classifiers based on n-gram features from virus samples, combining outputs through voting to achieve low false positive rates suitable for real-world use.
Applications of genetic algorithms to malware detection and creationUltraUploader
This document summarizes and analyzes previous research on applying genetic algorithms to malware detection and creation. Section 2 summarizes a paper that compared the performance of genetic algorithm-based classifiers to non-genetic classifiers for detecting malware. It found genetic algorithms performed comparably to other methods in classification accuracy but with lower processing overhead. Sections 3 and 4 summarize papers applying genetic algorithms to optimize parameters for real-time malware detection and to evolve malware signatures similar to antibodies. Section 5 discusses using genetic algorithms to evolve malware. The document analyzes the effectiveness of genetic algorithms for malware detection tasks and issues around using them to evolve malware.
Anti malware tools intrusion detection systemsUltraUploader
This document provides an overview of using intrusion detection systems (IDS), specifically Snort, to detect malware. It discusses getting and installing Snort on Windows and Linux, and recommended additions like MySQL, BASE, and ACID. The document outlines what the finished installation interface might look like when using BASE or ACID. It then discusses how to create effective malware signatures/rules for Snort by examining malware samples and extracting detection strings from their code or encoded attachments. Creating rules for multi-vector malware may require multiple signatures to detect it in different propagation states.
A sense of 'danger' for windows processesUltraUploader
This document summarizes research on using Dendritic Cell Algorithms (DCA) for malware detection. The researchers collected API call traces from real malware and benign Windows processes to evaluate the accuracy of classical DCA (cDCA) and deterministic DCA (dDCA) for classifying processes as malware or benign. They also studied the effects of antigen multiplier and time-windows on the detection accuracy of the algorithms.
A FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLSIJNSA Journal
Malware writers have employed various obfuscation and polymorphism techniques to thwart static analysis approaches and bypassing antivirus tools. Dynamic analysis techniques, however, have essentially overcome these deceits by observing the actual behaviour of the code execution. In this regard, various methods, techniques and tools have been proposed. However, because of the diverse concepts and strategies used in the implementation of these methods and tools, security researchers and malware analysts find it difficult to select the required optimum tool to investigate the behaviour of a malware and to contain the associated risk for their study. Focusing on two dynamic analysis techniques: Function Call monitoring and Information Flow Tracking, this paper presents a comparison framework for dynamic malware analysis tools. The framework will assist the researchers and analysts to recognize the tool’s implementation strategy, analysis approach, system-wide analysis support and its overall handling of binaries, helping them to select a suitable and effective one for their study and analysis.
Optimised Malware Detection in Digital Forensics IJNSA Journal
This summarizes a research paper that proposes developing a new framework to optimize malware detection in digital forensics investigations. The paper discusses challenges with existing detection methods, such as signature-based approaches requiring extensive manual analysis. Through a market research survey of forensics professionals, the paper finds weaknesses in current skills, tools, and accuracy rates. Most respondents agreed a new customized detection tool is needed that employs both dynamic and static analysis methods. The proposed framework aims to address these issues to more effectively detect and analyze malware.
Abstract: The exponential growth of the internet and new technology lead today's world in a hectic situation both positive as well as the negative module. Cybercriminals gamble in the dark net using numerous techniques. This leads to cybercrime. Cyber threats like Malware attempt to infiltrate the computer or mobile device offline or internet, chat(online), and anyone can be a potential target. Malware is also known as malicious software is often used by cybercriminals to achieve their goal by tracking internet activity, capturing sensitive information, or blocking computer access. Reverse engineering is one of the best ways to prevent and is a powerful tool to keep the fight against cyber attacks. Most people in the cyber world see it as a black hat—It is said as being used to steal data and intellectual property. But when it is in the hands of cybersecurity experts, reverse engineering dons the white hat of the hero. Looking at the program from the outside in –often by a third party that had no hand in writing the code. It allows those who practice it to understand how a given program or system works when no source code is available. Reverse engineering accomplishing several tasks related to cybersecurity: finding system vulnerabilities, researching malware &analyzing the complexity of restoring core software algorithms that can further protect against theft. It is hard to hack certain software.
Keywords: Malware, threat, vulnerablity, detection, reverse engineering, analysis.
Title: Malware analysis and detection using reverse Engineering
Author: B.Rashmitha, J. Alwina Beauty Angelin, E.R. Ramesh
International Journal of Computer Science and Information Technology Research
ISSN 2348-1196 (print), ISSN 2348-120X (online)
Vol. 10, Issue 2, Month: April 2022 - June 2022
Page: (1-4)
Published Date: 01-April-2022
Research Publish Journals
Available at: www.researchpublish.com
You can Direct download full research paper at given below link:
https://www.researchpublish.com/papers/malware-analysis-and-detection-using-reverse-engineering
Academia Link: https://www.academia.edu/76069664/Malware_analysis_and_detection_using_reverse_Engineering_Available_at_www_researchpublish_com_journal_name_International_Journal_of_Computer_Science_and_Information_Technology_Research
Vulnerability scanners a proactive approach to assess web application securityijcsa
With the increasing concern for security in the network, many approaches are laid out that try to protect
the network from unauthorised access. New methods have been adopted in order to find the potential
discrepancies that may damage the network. Most commonly used approach is the vulnerability
assessment. By vulnerability, we mean, the potential flaws in the system that make it prone to the attack.
Assessment of these system vulnerabilities provide a means to identify and develop new strategies so as to
protect the system from the risk of being damaged. This paper focuses on the usage of various vulnerability
scanners and their related methodology to detect the various vulnerabilities available in the web
applications or the remote host across the network and tries to identify new mechanisms that can be
deployed to secure the network.
Malware Detection Using Data Mining Techniques Akash Karwande
This document discusses techniques for malware detection using data mining. It begins by defining the problem of malware as one of the most serious issues faced on the internet. It then discusses types of malware like viruses, worms, trojans, and rootkits. It describes how rootkits can hide themselves and their activities. The document outlines static and dynamic analysis methods for malware detection and describes signature-based and behavior-based detection techniques. It shows results from using the Weka tool achieving over 97% success in rootkit detection. Advanced techniques discussed include n-grams and analyzing API/system calls.
Problems With Battling Malware Have Been Discussed, Moving...Deb Birch
This document discusses several new methods for detecting malware, including CPU analyzers, holography, eigenvirus detection, differential fault analysis, and whitelist protection. It notes that due to a focus on deobfuscation, these ideas have only recently been explored and are still underdeveloped. Specific methods like CPU analyzers and holography are examined in more detail.
Unveiling the Shadows: A Comprehensive Guide to Malware Analysis for Ensuring...cyberprosocial
Malicious software, or malware, is a constant concern in the networked world of digital landscapes. Cybercriminals are always improving their strategies, which makes malware more complex and difficult to identify. To combat this, protecting computer systems requires an understanding of and application of malware analysis.
MACHINE LEARNING APPLICATIONS IN MALWARE CLASSIFICATION: A METAANALYSIS LITER...IJCI JOURNAL
With a text mining and bibliometrics approach, this study reviews the literature on the evolution
of malware classification using machine learning. This work takes literature from 2008 to 2022
on the subject of using machine learning for malware classification to understand the impact of
this technology on malware classification. Throughout this study, we seek to answer three main
research questions: RQ1: Is the application of machine learning for malware classification
growing? RQ2: What is the most common machine-learning application for malware
classification? RQ3: What are the outcomes of the most common machine learning
applications? The analysis of 2186 articles resulting from a data collection process from peerreviewed databases shows the trajectory of the application of this technology on malware
classification as well as trends in both the machine learning and malware classification fields of
study. This study performs quantitative and qualitative analysis using statistical and N-gram
analysis techniques and a formal literature review to answer the proposed research questions.
The research reveals methods such as support vector machines and random forests to be
standard machine learning methods for malware classification in efforts to detect maliciousness
or categorize malware by family. Machine learning is a highly researched technology with
many applications, from malware classification and beyond.
A trust system based on multi level virus detectionUltraUploader
This document summarizes a research paper that proposes a new multi-level virus detection system (MDS). The MDS uses three levels of protection: 1) A smart memory monitor that detects virus behavior in real-time, 2) A file checker that analyzes batch files for virus-like code, and 3) An integrity checker that stores file signatures to detect modifications where viruses typically infect. The system was tested and able to detect virus activity through monitoring, file analysis, and integrity checking at different levels simultaneously. The paper concludes the MDS approach provides improved virus detection over single-method systems.
IRJET- Zombie - Venomous File: Analysis using Legitimate Signature for Securi...IRJET Journal
The document discusses a proposed method for detecting viruses and malware that evade existing antivirus software. It uses a combination of analyzing files with VirusTotal's database of known threats and applying natural language processing techniques like suffix trees and TF-IDF to identify malicious patterns in files. An evaluation shows the proposed method can detect viruses that existing antivirus and VirusTotal miss, achieving a 97% accuracy rate in testing.
1. The document describes RAVE, a Replicated AntiVirus Engine system for email infrastructures that uses fault tolerance concepts. It runs multiple antivirus engines in parallel to improve detection capabilities.
2. By having several replicas running different antivirus engines (and operating systems), the system offers very high detection efficiency with no downtime during updates and can tolerate arbitrary failures of a predefined number of replicas.
3. The system aims to address issues with existing antivirus solutions like increased complexity leading to more vulnerabilities, and improve over solutions running multiple engines by providing fault tolerance against failures or attacks.
Acquisition of malicious code using active learningUltraUploader
This document presents a methodology for detecting unknown malicious code using an active learning framework. It discusses how machine learning algorithms have been used successfully to detect malicious code based on n-gram representations of binary files. The authors propose using active learning to efficiently acquire unknown malicious files from a stream of executable files. They evaluate their approach on a test collection of over 30,000 files and show that active learning improves the accuracy of the classifier and the efficiency of acquiring new malicious files.
Malware Risk Analysis on the Campus Network with Bayesian Belief NetworkIJNSA Journal
A security network management system is for providing clear guidelines on risk evaluation and assessment for enterprise networks. The threat and risk assessment is conducted to safeguard enterprise network services to maintain system confidentiality, integrity, and availability through effective control strategies. In this paper, based on our previous work in analyzing integrated information security management and malware propagation on the campus network through mathematical modelling, we proposed Bayesian Belief Network with inference level indicator to enable the decision maker to understand and provide appropriate mitigation decisions on the risks posed. We experimentally placed monitoring sensors on the campus network that gives the threat alert priority levels and magnitude on the vulnerable information assets. These methods will give a direction on the belief inferred due to malware prevalence on the information security assets for better understanding.
AN ISP BASED NOTIFICATION AND DETECTION SYSTEM TO MAXIMIZE EFFICIENCY OF CLIE...IJNSA Journal
End users are increasingly vulnerable to attacks directed at web browsers which make the most of popularity of today’s web services. While organizations deploy several layers of security to protect their systems and data against unauthorised access, surveys reveal that a large fraction of end users do not utilize and/or are not familiar with any security tools. End users’ hesitation and unfamiliarity with security products contribute vastly to the number of online DDoS attacks, malware and Spam distribution. This work on progress paper proposes a design focused on the notion of increased participation of internet service providers in protecting end users. The proposed design takes advantage of three different detection tools to identify the maliciousness of a website content and alerts users through utilising Internet Content Adaptation Protocol (ICAP) by an In-Browser cross-platform messaging system. The system also incorporates the users’ online behaviour analysis to minimize the scanning intervals of malicious websites database by client honeypots. Findings from our proof of concept design and other research indicate that such a design can provide a reliable hybrid detection mechanism while introducing low delay time into user browsing experience.
Malware is a worldwide pandemic. It is designed to damage computer systems without
the knowledge of the owner using the system. Software‟s from reputable vendors also contain
malicious code that affects the system or leaks information‟s to remote servers. Malware‟s includes
computer viruses, spyware, dishonest ad-ware, rootkits, Trojans, dialers etc. Malware detectors are
the primary tools in defense against malware. The quality of such a detector is determined by the
techniques it uses. It is therefore imperative that we study malware detection techniques and
understand their strengths and limitations. This survey examines different types of Malware and
malware detection methods.
This document describes a proposed vulnerability management system (VMS) that aims to automate the process of scanning software applications to identify vulnerabilities. The proposed system uses a hybrid algorithm approach that incorporates features from existing vulnerability detection tools and algorithms. The algorithm involves five main phases: inspection, scanning, attack detection, analysis, and reporting. The algorithm is intended to increase the accuracy of vulnerability detection compared to existing systems. The proposed VMS system and hybrid algorithm were tested using various vulnerability scanning tools on virtual machines, and results demonstrated that the VMS could automate the vulnerability assessment process and generate reports on detected vulnerabilities with severity levels. The main limitation is that scans using the VMS may take more time than some existing tools.
A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODSijaia
This document presents a static malware detection system using data mining techniques. The system extracts raw features from Windows Portable Executable (PE) files including PE header information, DLLs, and API functions. It then selects important features using Information Gain and reduces dimensions using Principal Component Analysis. Three classifiers (SVM, J48, Naive Bayes) are trained on the transformed feature vectors to classify files as malicious or benign. When evaluated on a dataset of over 247,000 files, the system achieved a detection rate of 99.6%.
An analysis of how antivirus methodologies are utilized in protecting compute...UltraUploader
This document discusses different methodologies used by antivirus software to detect and protect against malicious code. It describes three main categories of antivirus scanning: signature detection through file scanning, heuristics scanning, and general decryption scanning. Signature detection involves comparing files to known virus signatures in a database. Heuristics scanning evaluates patterns of behavior to detect abnormal application activity. General decryption scanning is used to detect encrypted or polymorphic viruses.
Malwise-Malware Classification and Variant ExtractionIOSR Journals
This document summarizes a research paper on classifying and extracting variants of malware. It discusses using both dynamic and static analysis to classify malware, including using entropy analysis to detect unpacking of packed malware. It proposes using control flow graphs and matching algorithms to perform malware classification and generate signatures to detect variants. The paper presents the methodology used, including generating signature trees and feature extraction. It evaluates classification algorithms like Naive Bayes, J48, and Random Forest on real and synthetic malware datasets. The conclusion is that the approaches can effectively identify malware variants and new malware is often a variant of existing malware.
Today’s threats have become very complex and serious in their packing and encryption techniques. Every day new malware variants are becoming increasingly in quantity together with quality by using packing and encrypting techniques. The challenges in this research field are the traditional malware detection systems sometimes might fail to detect new malware variants and produces false alarms. Malicious software in the form of virus, worm, trojan, ransom, and spy harms our computer systems, network environment, and organizations in various ways. Therefore, malware analysis for detection and family classification plays a significant role in Cyber Crime Incident Handling Systems. This system contributes malware family classification with 10 prominent features by conduction feature selection process. The process of labeling the malicious samples using Regular Expressions has been contributed in this approach. The proposed malware classification system provides 7 different families including malware and benign using machine learning classifiers. The finding from our experiment proves that the selected 10 API features provide the best evaluation metrics in terms of accuracy, precision-recall, and ROC scores.
Similar to Automated classification and analysis of internet malware (20)
This document is the manual for PHP, the PHP Documentation Group's copyright from 1997 to 2002. It contains information about installing and configuring PHP on various operating systems like Unix, Linux, Windows, etc. It also covers PHP syntax, functions, classes, and other features. The manual is distributed under the GNU General Public License and parts of it are also distributed under the Open Publication License. It was translated into Italian with contributions from multiple people.
Broadband network virus detection system based on bypass monitorUltraUploader
The document describes a Broadband Network Virus Detection System (VDS) based on bypass monitoring that can detect viruses on high-speed networks. The VDS uses four detection engines to analyze network traffic for viruses based on binary content, URLs, emails, and scripts. It accurately logs statistical information on detected viruses like name, source/target IPs, and spread frequency. The VDS mirrors network traffic to a detection engine in real-time without needing to reassemble packets into files. This allows it to efficiently detect viruses directly in network packets or data streams on gigabit-speed networks.
This document discusses botnets and their applications. It begins with an overview of botnets, how they are controlled through command and control servers, and how rootkits can help conceal botnet activity. It then explores how botnets can be used for spam, phishing, click fraud, identity theft, and distributed denial-of-service attacks. Detection and mitigation techniques are also summarized, including network intrusion detection, honeynets, DNS monitoring, and modeling botnet propagation across timezones. Recent botnets like AgoBot, PhatBot, and Bobax are also examined in the context of spam distribution. Open research questions around botnet membership detection, click fraud detection, and phishing detection are presented.
Bot software spreads, causes new worriesUltraUploader
Bot software infects millions of computers worldwide without the owners' knowledge and turns them into zombies that perform malicious tasks as part of a bot network. These bot networks, which can include thousands of infected computers, are used to spread viruses and worms, send spam emails, install spyware, and launch denial-of-service attacks. While initially just an automated way to spread malware, bot networks are now also used for criminal activities like identity theft due to their ability to stealthily command a large number of compromised computers. Security experts warn that the proliferation of bot networks poses serious risks and is very difficult to stop given their automation and scale.
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...UltraUploader
The document discusses blended threats that combine exploits and vulnerabilities with computer viruses. It begins with definitions of blended attacks and buffer overflows. It then describes three generations of buffer overflow techniques as well as other vulnerabilities exploited by blended threats, such as URL encoding and MIME header parsing. The document also discusses past threats like the Morris worm and CodeRed that blended exploits with viruses, and techniques used to combat future blended threats through defense in depth.
Win32/Blaster was a worm that exploited a vulnerability in Windows RPC to infect systems running Windows 2000 and Windows XP. It installed itself to automatically run on startup and then attempted to infect other systems on the local network and randomly selected IP addresses. The infection process involved exploiting the RPC vulnerability to execute a remote shell, downloading the worm binary, and executing it. It also launched a SYN flooding DDoS attack against Windows Update sites each month after the 16th. The worm spread quickly after the vulnerability was disclosed and highlighted the increasing automation and harm of worms.
Biologically inspired defenses against computer virusesUltraUploader
This document discusses two biologically inspired approaches to computer virus detection and removal: a neural network virus detector that learns to identify infected and uninfected programs, and a computer immune system that can automatically identify, analyze, and remove new viruses from a system. The neural network technique has been incorporated into an IBM commercial antivirus product, while the computer immune system is still in prototype form. Both aim to replace human analysis of viruses to allow faster response times needed to address increasing rates of new virus creation and spread.
1. The document discusses biological viruses and computer viruses, providing background on how biological viruses work by hijacking cellular mechanisms of DNA replication, transcription and translation. It defines a computer virus as a piece of code with self-replicating ability that relies on other programs to exist, similar to biological viruses. 2. Computer viruses can cause damage by infecting programs which then infect other programs, potentially spreading like an epidemic across connected computers. 3. The document argues that a better understanding of biological and computer mechanisms can help improve defenses against viruses.
Biological aspects of computer virologyUltraUploader
This document discusses biological aspects of computer viruses and how factors that influence the spread of biological pathogens can also affect the propagation of computer malware. It analyzes three major factors that influence the spread of a computer worm: the infection propagator, which examines characteristics of exploited vulnerabilities like prevalence and age; the target locator, which focuses on how worms find new targets; and the worm's virulence, which looks at aspects that increase its infectiousness. The document suggests studying computer virus propagation through the lens of epidemiology models used for infectious diseases.
Biological models of security for virus propagation in computer networksUltraUploader
This document discusses how biological models of disease propagation and defense mechanisms in living organisms can inspire new approaches to computer network security and virus detection. Specifically, it describes how genetic regulatory networks that turn off harmful genes, protein interaction networks that model cellular processes, and epidemiological models of disease spread can provide models for automatically detecting and containing computer viruses without relying solely on pre-defined virus signatures. The authors propose several new security models drawing on these biological analogies, such as using surrogate code to maintain system functionality when parts are shut off, modeling network interactions to determine how viruses propagate, and evolving network services in real-time to reconstitute functionality after attacks.
Beyond layers and peripheral antivirus securityUltraUploader
This white paper from Trend Micro discusses strategies for effective antivirus security beyond just protecting desktops. It argues that while desktop protection is still important, viruses often spread faster than antivirus updates can be deployed to endpoints. It therefore recommends taking additional measures across the network like stopping viruses at email/file servers, firewalls, and through education. The paper provides an overview of virus impacts and outlines Trend Micro's solutions that can block new threats before pattern updates and help repair damage.
Automated classification and analysis of internet malware
1. Automated Classification and Analysis of Internet Malware
Michael Bailey,*
Jon Oberheide,*
Jon Andersen,*
Z. Morley Mao,*
Farnam Jahanian,*†
Jose Nazario †
*
Electrical Engineering and Computer Science Department
University of Michigan
{mibailey, jonojono, janderse, zmao, farnam}@umich.edu
†
Arbor Networks
{farnam, jose}@arbor.net
April 26, 2007
Abstract
Numerous attacks, such as worms, phishing, and botnets, threaten the availability of the Internet,
the integrity of its hosts, and the privacy of its users. A core element of defense against these attacks is
anti-virus(AV)–a service that detects, removes, and characterizes these threats. The ability of these prod-
ucts to successfully characterize these threats has far-reaching effects—from facilitating sharing across
organizations, to detecting the emergence of new threats, and assessing risk in quarantine and cleanup.
In this paper, we examine the ability of existing host-based anti-virus products to provide semantically
meaningful information about the malicious software and tools (or malware) used by attackers. Using a
large, recent collection of malware that spans a variety of attack vectors (e.g., spyware, worms, spam), we
show that different AV products characterize malware in ways that are inconsistent across AV products,
incomplete across malware, and that fail to be concise in their semantics. To address these limitations, we
propose a new classification technique that describes malware behavior in terms of system state changes
(e.g., files written, processes created) rather than in sequences or patterns of system calls. To address
the sheer volume of malware and diversity of its behavior, we provide a method for automatically catego-
rizing these profiles of malware into groups that reflect similar classes of behaviors and demonstrate how
behavior-based clustering provides a more direct and effective way of classifying and analyzing Internet
malware.
1 Introduction
Many of the most visible and serious problems facing the Internet today depend on a vast ecosystem of
malicious software and tools. Spam, phishing, denial of service attacks, botnets, and worms largely depend on
some form of malicious code, commonly referred to as malware. Malware is often used to infect the computers
of unsuspecting victims by exploiting software vulnerabilities or tricking users into running malicious code.
Understanding this process and how attackers use the backdoors, key loggers, password stealers and other
malware functions is becoming an increasingly difficult and important problem.
Unfortunately, the complexity of modern malware is making this problem more difficult. For example,
Agobot [3], has been observed to have more than 580 variants since its initial release in 2002. Modern Agobot
variants have the ability to perform denial of service attacks, steal bank passwords and account details,
propagate over the network using a diverse set of remote exploits, use polymorphism to evade detection
and disassembly, and even patch vulnerabilities and remove competing malware from an infected system [3].
Making the problem even more challenging is the increase in the number and diversity of Internet malware.
A recent Microsoft survey found more than 43,000 new variants of backdoor trojans and bots during the
first half of 2006 [22]. Automated and robust approaches to understanding malware are required in order to
successfully stem the tide.
1
2. Dataset Date Number of Number of Unique Labels
Name Collected Unique MD5s McAfee F-Prot ClamAV Trend Symantec
legacy 01 Jan 2004 - 31 Dec 2004 3,637 116 1216 590 416 57
small 03 Sep 2006 - 22 Oct 2006 893 112 379 253 246 90
large 03 Sep 2006 - 18 Mar 2007 3,698 310 1,544 1,102 2,035 50
Table 1: The datasets used in this paper: A large collection of legacy binaries from 2004, a small 6 week
collection from 2006, and a large 6 month collection of malware from 2006/2007. The number of unique
labels provided by 5 AV systems is listed for each dataset.
Previous efforts to automatically classify and analyze malware (e.g., AV, IDS) focused primarily on
content-based signatures. Unfortunately, content-based signatures are inherently susceptible to inaccuracies
due to polymorphic and metamorphic techniques. In addition, the signatures used by these systems often
focus on a specific exploit behavior–an approach increasingly complicated by the emergence of multi-vector
attacks. As a result, IDS and AV products characterize malware in ways that are inconsistent across products,
incomplete across malware, and that fail to be concise in their semantics. This creates an environment in
which defenders are limited in their ability to share intelligence across organizations, to detect the emergence
of new threats, and to assess risk in quarantine and cleanup of infections.
To address the limitations of existing automated classification and analysis tools, we have developed
and evaluated a dynamic analysis approach based on the execution of malware in virtualized environments
and the causal tracing of the operating system objects created as a result of the malware’s execution. The
reduced collection of these user visible system state changes (e.g., files written, processes created) is used
to create a fingerprint of the malware’s behavior. These fingerprints are more invariant and directly useful
than abstract code sequences representing programmatic behavior and can be directly used in assessing
the potential damage incurred, enabling detection and classification of new threats, and assisting in the risk
assessment of these threats in mitigation and clean up. To address the shear volume of malware and diversity
of its behavior, we provide a method for automatically categorizing these profiles of malware into groups
that reflect similar classes of behaviors. These methods are thoroughly evaluated in the context of a malware
dataset that is large, recent, and diverse in the set of attack vectors it represents (e.g., spam, worms, bots,
spyware).
This paper is organized as follows: Section 2 describes the shortcomings of existing AV software and
enumerates requirements for effective malware classification. We present our behavior-based fingerprint
extraction and fingerprint clustering algorithm in Section 3. Our detailed evaluation is shown in Section 4.
We present existing work in Section 5, offer limitations and future directions in Section 6, and conclude in
Section 7.
2 Anti-virus clustering of malware
Host-based AV systems detect and remove malicious threats from end systems. As a normal part of this
process these AV programs provide a description for the malware they detected. The ability of these products
to successfully characterize these threats has far-reaching effects—from facilitating sharing across organiza-
tions, to detecting the emergence of new threats, and assessing risk in quarantine and cleanup. However,
for this information to be effective, the descriptions provided by these systems must be meaningful. In this
section, we evaluate the ability of host-based AV to provide meaningful intelligence on Internet malware.
2.1 Understanding anti-virus malware labeling
In order to accurately characterize the ability of AV to provide meaningful labels for malware, we first need
to acquire representative datasets. In this paper, we use three datasets from two sources as shown in Table 1.
One dataset, legacy, is taken from a network security community malware collection and consists of randomly
sampled binaries from those posted to the community’s FTP server in 2004. In addition, we use a large,
recent 6-month collection of malware and a 6-week subset of that collection at the beginning of the dataset
collection period. The small and large datasets are a part of the Arbor Malware Library (AML). Created
by Arbor Networks, Inc. [24] the AML consists of binaries collected by a variety of techniques including Web
2
3. Label Software Vendor Version Signature File
McAfee Virus Scan McAfee, Inc. v4900 20 Nov 2006
v5100 31 Mar 2007
F-Prot F-Prot Anti-virus FRISK Software 4.6.6 20 Nov 2006
International 6.0.6.3 31 Mar 2007
ClamAV Clam Anti-virus Tomasz Kojm and 0.88.6 20 Nov 2006
the ClamAV Team 0.90.1 31 Mar 2007
Trend PC-cillin Internet Trend Micro, Inc. 8.000-1001 20 Nov 2006
Security 2007 8.32.1003 31 Mar 2007
Symantec Norton Anti-virus Symantec 14.0.0.89 20 Nov 2006
2007 Corporation 14.0.3.3 31 Mar 2007
Table 2: Anti-virus software, vendors, versions, and signature files used in this paper. The small and legacy
datasets were evaluated with a version of these systems in November of 2006 and both small and large were
evaluated again with a version of these systems in March of 2007
page crawling [32], spam traps [28], and honeypot-based vulnerability emulation [2]. Since each of these
methods collects binaries that are installed on the target system without the user’s permission, the binaries
collected are highly likely to be malicious. Almost 3,700 unique binaries were collected over a 6-month period
in late 2006 and early 2007.
After collecting the binaries, we analyzed them using the AV scanners shown in Table 2. Each of the
scanners was the most recent available from each vendor at the time of the analysis. The virus definitions
and engines were updated uniformly on November 20th, 2006, and then again on March 31st, 2007. Note
that the first update was over a year after the legacy collection ended and one month after the end of the
small set collection. The second update was 13 days after the end of the large set collection.
AV systems rarely use the exact same labels for a threat and users of these systems have come to
expect simple naming differences (e.g., W32Lovsan.worm.a versus Lovsan versus WORM MSBLAST.A)
across vendors. It has always been assumed, however, that there existed a simple mapping from one system’s
name space to another and recently investigators have begun creating projects to unify these name spaces [4].
Unfortunately, the task appears daunting. Consider, for example, the number of unique labels created by
various systems. The result in Table 1 is striking–there is a substantial difference in the number of unique
labels created by each AV system. While one might expect small differences, it is clear that AV vendors
disagree not only on what to label a piece of malware, but also on how many unique labels exist for malware
in general.
One simple explanation of these differences in the number of labels is that some of these AV systems
provide a finer level of detail into the threat landscape than the others. For example, the greater number
of unique labels in Table 1 for F-Prot may be the result of F-Prot’s ability to more effectively differentiate
small variations in a family of malware. To investigate this conjecture, we examined the labels of the legacy
dataset produced by the AV systems and, using a collection of simple heuristics for the labels, we created a
pool of malware classified by F-Prot, McAfee, and ClamAV as SDBot [21]. We then examined the percentage
of time each of the three AV systems classified these malware samples as part of the same family. The result
of this analysis can be seen in Figure 1. Each AV classifies a number of samples as SDBot yet the intersection
of these different SDBot families is not clean, since there are many samples that are classified as SDBot by
one AV and as something else by the others. It is clear that these differences go beyond simple differences
in labeling—anti-virus products assign distinct semantics to differing pieces of malware.
2.2 Properties of a Labeling System
Our previous analysis has provided a great deal of evidence indicating that labeling across AV systems does
not operate in a way that is useful to researchers, operators, and end users. Before we evaluate these systems
any further, it is important to precisely define the properties an ideal labeling system should have. We have
identified three key design goals for such a labeling system:
• Consistency. Similar items must be assigned the same label.
• Completeness. A label should be generated for as many items as possible.
3
4. Figure 1: A Venn diagram of malware labeled as SDBot variants by three AV products in the legacy dataset.
The classification of SDBot is ambiguous.
legacy small
McAfee F-Prot ClamAV Trend Symantec McAfee F-Prot ClamAV Trend Symantec
McAfee 100 13 27 39 59 100 25 54 38 17
F-Prot 50 100 96 41 61 45 100 57 35 18
ClamAV 62 57 100 34 68 39 23 100 32 13
Trend 67 18 25 100 55 45 23 52 100 16
Symantec 27 7 13 14 100 42 25 46 33 100
Table 3: The percentage of time two binaries classified as the same by one AV are classified the same by
other AV systems. Malware is inconsistently classified across AV vendors.
• Conciseness. The label should reflect a specific meaning; either embedded in the label itself or by
reference.
2.3 Limitations of anti-virus
Having identified consistency, completeness, and conciseness as the design goals of a labeling system, we are
now prepared to investigate the ability of AV systems to meet these goals.
2.3.1 Consistency
In order to investigate consistency, we grouped malware into categories based on the labels provided by one of
the AV vendors. For each pair of distinct malware labeled as same by a particular system, we compared the
percentage of time the same pair was classified by each of the other AV systems as the same. For example,
two binaries in our legacy dataset, with different MD5 checksums, were labeled as W32-Blaster-worm-a by
McAfee. These two binaries were labeled consistently by F-Prot (both as msblast), and Trend (both as
msblast), but inconsistently by Symantec (one blaster and one not detected) and ClamAV (one blaster, one
dcom.exploit). We then selected each system in turn and used their classification as the base. For example,
table 3 shows that malware classified by McAfee the same was only classified the same by F-Prot 13% of the
time. However, malware classified by F-Prot as the same, was only classified as the same by McAfee 50% of
4
5. Dataset AV Updated Percentage of Malware Samples Detected
Name McAfee F-Prot ClamAV Trend Symantec
legacy 20 Nov 2006 100 99.8 94.8 93.73 97.4
small 20 Nov 2006 48.7 61.0 38.4 54.0 76.9
small 31 Mar 2007 67.4 68.0 55.5 86.8 52.4
large 31 Mar 2007 54.6 76.4 60.1 80.0 51.5
Table 4: The percentage of malware samples detected across datasets and AV vendors. AV does not provide
a complete categorization of the datasets.
AV Number of Pages
0 1-10 11-99 100+
McAfee 2 32 62 15
F-Prot 100 0 0 0
ClamAV 100 0 0 0
Trend 82 9 7 2
Symantec 2 7 71 20
Table 5: The percentage of malware labels that returned 0, 1-10, 11-99, or 100+ pages when searched on
the AV vendor’s web-site. AV labels provide too little information or too much.
the time. Not only do AV systems place malware into different categories, these categories do not hold the
same meaning across systems.
2.3.2 Completeness
As discussed earlier, the design goal for completeness is to provide a label for each and every item to be
classified. For each of the datasets and AV systems, we examined the percentage of time the AV systems
detected a given piece of malware (and hence provided a label). A small percentage of malware samples
are still undetected a year after the collection of the legacy datasets (table 4). The results for more recent
samples are even more profound, with almost half the samples undetected in small and one quarter in large.
The one quarter undetected for the large set is likely an overestimate of the ability of the AV, as many of
the binaries labeled at that point were many months old (e.g., compare the improvement over time in the
two labeling instances of small). Thus, AV systems do not provide a complete labeling system.
2.3.3 Conciseness
Conciseness refers to the ability of the labeling system to provide a meaningful label to a given item. A label
which carries either too much or too little meaning has minimal value. In Table 5 examines the ability of
AV systems to provide conciseness. Because most AV labels do not carry significant meaning in themselves,
many vendors provide additional details of a threat by reference. Using the unique labels provided by each
AV system for the small dataset, we searched the AV vendor’s website for additional information about
the threat encountered. In many cases, the vendors provided no information about the threat (0 pages), or
simply too much information (100s of pages). McAfee performed the best with 31% of its queries yielding
a manageable number of search results. The numbers for ClamAV reflect that the project does not provide
meaning to any of its labels. Trend Micro provides a separate “virus encyclopedia” which matches each
malware with a single page 95% of the time. Unfortunately, the vast majority of these matches simply parse
the name to extract a category, and provide only a description of a general idea such as “trojan” or “worm”,
and not any specifics of the threat. AV systems do not provide concise representations of malware.
3 Behavior-based malware clustering
As we described in the previous section, any meaningful labeling system must achieve consistency, com-
pleteness, and conciseness and existing approaches, such as anti-virus, fail to perform well on these metrics.
To address these limitations, we propose an approach based on the actual execution of malware samples
and observation of their persistent state changes. These state changes taken together make a behavioral
5
6. New processes
directs.exe
Modified files
/WINDOWS/avserve2.exe
/WINDOWS/system32/directs.exe
/WINDOWS/system32/directs.exeopen
Modified registry keys
HKCU/Software/Microsoft/Windows/CurrentVersion/Ru1n/directs.exe
HKLM/SOFTWARE/Microsoft/Windows/CurrentVersion/Run/avserve2.exe
Network access
scans port 445
connects to port 445
Table 6: An example of behavioral profile for a malware sample labeled as W32-Bagle-q by McAfee.
Label MD5 P/F/R/N McAfee Trend
A 71b99714cddd66181e54194c44ba59df 8/13/27/0 Not detected W32/Backdoor.QWO
B be5f889d12fe608e48be11e883379b7a 8/13/27/0 Not detected W32/Backdoor.QWO
C df1cda05aab2d366e626eb25b9cba229 1/1/6/1 W32/Mytob.gen@MM W32/IRCBot-based!Maximus
D 5bf169aba400f20cbe1b237741eff090 1/1/6/2 W32/Mytob.gen@MM Not detected
E eef804714ab4f89ac847357f3174aa1d 1/2/8/3 PWS-Banker.gen.i W32/Bancos.IQK
F 80f64d342fddcc980ae81d7f8456641e 2/11/28/1 IRC/Flood.gen.b W32/Backdoor.AHJJ
G 12586ef09abc1520c1ba3e998baec457 1/4/3/1 W32/Pate.b W32/Parite.B
H ff0f3c170ea69ed266b8690e13daf1a6 1/2/8/1 Not detected W32/Bancos.IJG
I 36f6008760bd8dc057ddb1cf99c0b4d7 3/22/29/3 IRC/Generic Flooder IRC/Zapchast.AK@bd
J c13f3448119220d006e93608c5ba3e58 5/32/28/1 Generic BackDoor.f W32/VB-Backdoor!Maximus
Table 7: Ten unique malware samples. For each sample, the number of process, file, registry, and network
behaviors observed and the classifications given by various AV vendors are listed.
fingerprint, which can then be clustered with other fingerprints to define classes and subclasses of malware
that exhibit similar state change behaviors. In this section, we discuss our definition and generation of these
behavioral fingerprints and the techniques for clustering them.
3.1 Defining and generating malware behaviors
Previous work in behavioral signatures has been based at the abstraction level of low-level system events
such as individual system calls. In our system, the intent is to capture what the malware actually does on
the system. Such information is more invariant and directly useful to assess the potential damage incurred.
Individual system calls may be at a level that is too low for abstracting semantically meaningful information:
a higher abstraction level is needed to effectively describe the behavior of malware. We define the behavior
of malware in terms of non-transient state changes that the malware causes on the system. State changes are
a higher level abstraction than individual system calls, and avoid many common obfuscation techniques that
foil static analysis as well as low-level signatures, such as encrypted binaries and non-deterministic event
ordering. In particular, we extract simple descriptions of state changes from the raw event logs obtained
from malware execution. Spawned process names, modified registry keys, modified file names, and network
connection attempts are extracted from the logs and the list of such state changes becomes a behavioral
profile of a sample of malware. An example of this can be seen in table 6.
Observing the malware behavior requires actually executing the binaries. We execute each binary indi-
vidually inside a virtual machine [31] with Windows XP installed. The virtual machine is partially firewalled
so that the external impact of any immediate attack behaviors (e.g., scanning, DDoS, and spam) is min-
imized during the limited execution period. The system events are captured and exported to an external
server using the Backtracker system [14]. In addition to exporting system events, the Backtracker system
provides a means of building causal dependency graphs of these events. The benefit of this approach is that
we can validate that changes we observe are a direct result of the malware, and not of some normal system
operation.
6
7. A B C D E F G H I J
A 0.06 0.07 0.84 0.84 0.82 0.73 0.80 0.82 0.68 0.77
B 0.07 0.06 0.84 0.85 0.82 0.73 0.80 0.82 0.68 0.77
C 0.84 0.84 0.04 0.22 0.45 0.77 0.64 0.45 0.84 0.86
D 0.85 0.85 0.23 0.05 0.45 0.76 0.62 0.43 0.83 0.86
E 0.83 0.83 0.48 0.47 0.03 0.72 0.38 0.09 0.80 0.85
F 0.71 0.71 0.77 0.76 0.72 0.05 0.77 0.72 0.37 0.54
G 0.80 0.80 0.65 0.62 0.38 0.78 0.04 0.35 0.78 0.86
H 0.83 0.83 0.48 0.46 0.09 0.73 0.36 0.04 0.80 0.85
I 0.67 0.67 0.83 0.82 0.79 0.38 0.77 0.79 0.05 0.53
J 0.75 0.75 0.86 0.85 0.83 0.52 0.85 0.83 0.52 0.08
Table 8: A matrix of the NCD between each of the ten malware in our example.
3.2 Clustering of malware
While the choice of abstraction and generation of behaviors provides useful information to users, operators,
and security personnel, the sheer volume of malware makes manual analysis of each new malware intractable.
Our malware source observed 3,700 samples in a 6-month period–over 20 new pieces per day. Each generated
fingerprint, in turn, can exhibit many thousands of individual state changes (e.g., infecting every .exe on
a Windows host). For example, consider the tiny subset of malware in table 7. The 10 distinct pieces
of malware generate from 10 to 66 different behaviors with a variety of different labels including disjoint
families, variants, and undetected malware. While some items obviously belong together in spite of their
differences (e.g., C and D), even the composition of labels across AV systems can not provide a complete
grouping of the malware. Obviously, for these new behavioral fingerprints to be effective, similar behaviors
need to be grouped and appropriate meanings assigned.
Our approach to generating meaningful labels is achieved through clustering of the behavioral fingerprints.
In the following subsections we introduce this approach and the various issues associated with effective
clustering including how to compare fingerprints, combine them based on their similarity, and determine
which are the most meaningful groups of behaviors.
3.2.1 Comparing individual malware behaviors
While examining individual behavioral profiles provides useful information on particular malware samples,
our goal is to classify malware and give them meaningful labels. Thus malware samples must be grouped.
One way to group the profiles is to create a distance metric that measures the difference between any two
profiles, and use the metric for clustering. Our initial naive approach to defining similarity was based on the
concept of edit distance [8]. In this approach each behavior is treated as an atomic unit and we measure
the number of inserts of deletes of these atomic behaviors required to transform one behavioral fingerprint
into another. The method is fairly intuitive and straightforward to implement (think the Unix command
diff here), however, it suffers from two major drawbacks:
• Overemphasizing size When the size of the number of behaviors is large, edit distance is effectively
equivalent to clustering based on the length of the feature set. This overemphasizes differences over
similarities.
• Behavioral polymorphism Many of the clusters we observed had few exact matches for behaviors.
This is because the state changes made by malware may contain simple behavioral polymorphism (e.g.,
random file names).
To solve these shortcomings we turned to normalized compression distance (NCD). NCD is a way to
provide approximation of information content, and it has been successfully applied in a number of areas [27,
33]. NCD is defined as:
NCD(x, y) =
C(x + y) − min(C(x), C(y))
max(C(x), C(y))
where ”x + y” is the concatenation of x and y, and C(x) is the zlib-compressed length of x. Intuitively, NCD
represents the overlap in information between two samples. As a result, behaviors that are similar, but not
7
8. c1
A Bc2
E H
c3
C D
c4
G c5
F I
c6
c7
J
c8
c9
Figure 2: A tree consisting of the malware from table 7 has been clustered via a hierarchical clustering
algorithm whose distance function is normalized compression distance.
identical, are viewed as close (e.g., two registry entries with different values, random file names in the same
locations). Normalization, of course, then addresses the issue of differing information content. Table 8 shows
the normalized compression distance matrix for the malware described in Table 7.
3.2.2 Constructing relationships between malware
Once we know the information content shared between two sets of behavioral fingerprints, we can combine
various pieces of malware based on their similarity. In our approach we construct a tree structure based on
the well-known hierarchical clustering algorithm [12]. In particular, we use pairwise single-linkage clustering
which defines the distance between two clusters as the minimum distance between any two members of the
clusters. We output the hierarchical cluster results as a tree graph in graphviz’s dot format [16]. Figure 2
shows the generated tree for the malware in table 7.
3.2.3 Extracting meaningful groups
While the tree-based output of the hierarchical clustering algorithm does show the relationships between
the information content of behavioral fingerprints, it does not focus attention on areas of the tree in which
the similarities (or lack thereof) indicate a important group of malware. Therefore, we need a mechanism
to extract meaningful groups from the tree. A naive approach to this problem would be to set a single
threshold of the differences between two nodes in the tree. However, this can be problematic as a single
uniform distance does not accurately represent the distance between various subtrees. For example consider
the dendrogram in figure 3. The height of many U-shaped lines connecting objects in a hierarchical tree
illustrates the distance between the two objects being connected. As the figure shows, the difference between
the information content of subtrees can be substantial. Therefore, we require an automated means of
discovering where the most important changes occur.
To address this limitation, we adopt an “inconsistency” measure that is used to compute the difference
in magnitude between distances of clusters so that the tree can be cut into distinct clusters. Clusters
are constructed from the tree by first calculating the inconsistency coefficient of each cluster, and then
thresholding based on the coefficient. The inconsistency coefficient characterizes each link in a cluster tree
by comparing its length with the average length of other links at the same level of the hierarchy. The
higher the value of this coefficient, the less similar are the objects connected by the link. The inconsistency
coefficient calculation has one parameter, which is the depth below the level of the current link to consider
8
9. A B F I J C D E H G
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Figure 3: A dendrogram illustrating the distance between various subtrees.
Cluster Elements Overlap Example
c1 C, D 67.86% scans 25
c2 A, B 97.96% installs a cygwin rootkit
c3 E, G, H 56.60% disables AV
c4 F, I, J 53.59% IRC
Table 9: The clusters generated via our technique for the malware listed in table 7.
in the calculation. All the links at the current level in the hierarchy, as well as links down to the given depth
below the current level, are used in the inconsistency calculation.
In table 9 we see the result of the application of this approach to the example malware in table 7. The
ten unique pieces of malware generate four unique clusters. Each cluster shows the elements in that cluster,
the average number of unique behaviors in common between the clusters, and an example of a high-level
behavior in common between each binary in the cluster. For example, cluster one consists of C and D and
represents two unique behaviors of mytob, a mass mailing scanning worm. Five of the behaviors observed
for C and D are identical (e.g., scans port 25), but several other exhibit some behavioral polymorphism
(e.g., different run on reboot registry entries). The other three clusters exhibit similar expected results, with
cluster two representing the cygwin backdoors, cluster three the bancos variants, and cluster four a class of
IRC backdoors.
4 Evaluation
To demonstrate the effectiveness of behavioral clustering, we evaluate our technique on the large dataset
discussed in section 2. We demonstrate the runtime performance and the effect of various parameters on the
system, show the completeness, conciseness, and consistency of the generated clusters, and by illustrate the
utility of the clusters by answering relevant questions about the malware samples.
9
10. 0 100 200 300 400 500 600
Number of Malware to Cluster
0
5e+07
1e+08
1.5e+08
2e+08
2.5e+08
3e+08
Bytes
Inconsistency-based Tree Cutting
Normalized Compression Distance
Single-Linkage Hierarchical Clustering
Figure 4: The memory required for performing clustering based on the number of malware clustered (for a
variety of different sized malware behavior)
4.1 Performance and parametrization
In this section we examine the memory usage and the execution time for the hierarchical clustering algorithm
we have chosen. To obtain these statistics, we take random sub-samples of length between 1 to 526 samples
from the small dataset. For each sub-sample, we analyze its run time and memory consumption by running
ten trials for each. The experiments were performed on a Dell PowerEdge 4600 with two Intel Xeon MP
CPUs (3.00GHz), 4 GB of DDR ECC RAM, 146G Cheetah Seagate drive with an Adaptec 3960D Ultra160
SCSI adapter, running Fedora Core Linux.
We first decompose the entire execution process into these logical steps: (1) trace collection, (2) state
change extraction, (3) NCD distance matrix computation: an O(N2
) operation, (4) clustering the distance
matrix into a tree, (5) cutting the tree into clusters. We focus on the latter three operations specific to our
algorithm for performance evaluation. Figure 4 shows the memory usage for those three steps. As expected,
computing NCD requires most memory with exponential growth with increasing number of malware for
clustering. However, clustering 500 malware samples requires less than 300MB of memory. The memory
usage for the other two components grows at a much slower rate. Examining the run-time in Figure 5
indicates that all three components can complete within hundreds of seconds for clustering several hundred
malware samples.
The tree cutting algorithm has two parameters: the inconsistency measure and the depth value. Figure 6
illustrates their effects on the number of clusters produced for the small dataset for various look ahead
depths and inconsistency metrics. Values of between 4-6 appear at the knee of many of the curves. To
evaluate the effect of inconsistency, we fixed our look ahead depth to 4 and evaluated the number of clusters
versus the average size of the clusters for various inconsistency values in the large dataset. Intuitively, larger
inconsistency measures lead to fewer clusters and larger depth values for computing inconsistency result in
more clusters. The results of this analysis, shown in figure 7 show a smooth trade-off until an inconsistency
value of 2.3, where the clusters quickly collapse into a single cluster.
10
11. 0 100 200 300 400 500 600
Number of Malware to Cluster
0.0001
0.001
0.01
0.1
1
10
100
Seconds
Inconsistency-based Tree Cutting
Normalized Compression Distance
Single-Linkage Hierarchical Clustering
Figure 5: The runtime required for performing clustering based on the number of malware clustered (for a
variety of different sized malware behavior)
Conciseness Completeness Consistency
AV Families Variants Singleton Detected Detected Not Detected % Identical Behavior
Variants Labeled Identically
McAfee 309 309 166 2018 1680 54.6% 47.2%
F-Prot 194 1544 1464 2958 740 80.0% 31.1%
ClamAV 119 1102 889 2244 1454 60.7% 34.9%
Trend 137 2034 1908 2960 738 80.0% 44.2%
Symantec 107 125 65 1904 1794 51.5% 68.2%
Behavior 403 403 206 3387 311 91.6% 100%
Table 10: The Conciseness, Completeness, and Consistency of the clusters created with out algorithm on
the large dataset as compared to various AV vendors.
4.2 Measuring the Conciseness, Completeness, and Consistency of the clusters
Table 10 summarizes the clusters created using behavioral clustering on the large dataset from section 2.
Our algorithm created 403 clusters from the 3,698 individual pieces of malware. While it is infeasible to list
all the clusters here, a list of the clusters, the malware and behaviors in each cluster, and their AV labels are
available at http://www.eecs.umich.edu/ mibailey/malware/. In addition to a hand analysis of the clusters,
we evaluated the clusters for our stated goals. Table 10 show the results of this analysis.
4.2.1 Completeness
In order to measure completeness we examined the number of times we created a meaningful label for a
binary and compared this to the detection rates of the various AV products. For AV software, “not detected”
means no signature matched, despite the up-to-date signature information. For behavioral clustering, “not
detected” means that we identified no behavior. Roughly 311 binaries exhibited no behavior that we were
able to measure. The root cause of these errors include unsuccessful unpacking or crashing during execution,
corrupted binaries, and pop-up windows that required user interaction. In at least one case, we also observed
behaviors that were visible on a non-virtualized host but were not visible in the VMware environment. A
striking observation from the Table is that many AV software systems provide detection rate as low as 33%
11
12. 0 1 2 3 4
Inconsistency Threshold
0
25
50
75
100
125
150
175
200
225
250
275
300
325
NumberofClusters
1
2
4
6
8
10
12
14
Depth
Figure 6: The number of clusters generated for various values of the inconsistency parameter and depth.
compared to around 91% using behavioral clustering.
4.2.2 Conciseness
Conciseness represented the ability of the label to embed or refer to a specific detailed piece of information
that describes the threat. By design our approach always provides such a description for any detected
binary. However, simply providing the label is insufficient if we succeed in only creating a detailed list of
behaviors per binary, as the number of such descriptions will become too great to manage. To investigate
the number and type of descriptions created, we compared the number of families, variants and singletons
in our approach with those of the AV vendors. A variant is defined based on the label provided by the
AV software (e.g, W32-Sdbot.AC, Sdbot.42). Family is a generalized label heuristically extracted from the
variant label based on the portion that is intended to be human-readable (e.g., the labels above would be
in the “sdbot” family). Singletons are variants or families which contain only one sample. Typically an
excessive number of singleton families or variants indicates that the classification algorithm is not effective
at identifying commonalities across malware instances. We note that, although we have more clusters than
families for the AV systems, the number of clusters compared to the number of variants is quite small. In
addition, the number of singletons is greatly reduced as well.
4.2.3 Consistency
Consistency referred to the ability of our clusters to label or cluster in a way such that the meaning of the
label is not obscured. In our system we measure sameness as performing the same or similar behaviors. In
order to measure the completeness of the system, we examined the binaries that exhibited exactly identical
behavior. In the large sample roughly 2,200 binaries exhibited identical behaviors. These binaries created
267 groups of identical behavior. We compared the percentage of time the clusters were identified as the same
through our approach as well as the various AV system. As expected, our system placed all the identical
behaviors in the same clusters, while the AV systems failed to do so.
12
13. 0 0.5 1 1.5 2 2.5 3
Inconsistency
0.1
1
10
100
1000
10000
Average Cluster Size
Number of Clusters
Figure 7: The trade-off between the number of clusters, the average cluster size, and the inconsistency value.
4.3 Application of clustering and behavior signatures
In this subsection we look at several applications of this technique in the context of the clusters created by
our algorithm from the large dataset.
4.3.1 Classifying emerging threats
Behavioral classification can be effective in characterizing emerging threats not yet known or not detected
by AV signatures. For example, cluster c156 consists of three malware samples which exhibit malicious bot-
related behavior including IRC command and control activities. Each of the 75 behaviors observed in the
cluster is shared with other samples of the group 96.92% on average, meaning the malware samples within
the cluster have almost identical behavior. However, none of the AV vendors detect the samples in this
cluster except for F-Prot, which only detects one of the samples. It is clear that our behavioral classification
would assist in identifying these samples as emerging threats through their extensive malicious behavioral
profile.
4.3.2 Resisting binary polymorphism
Along similar lines to the previous example, behavioral classification can also assist in grouping an undetected
outlier sample, due to polymorphism or some other deficiency in the AV signatures, together with a common
family that it shares significant behaviors with. For example, cluster c80 consists of three samples that share
identical behaviors with distinctive strings ”bling.exe” and ”m0rgan.org”. The samples in this cluster are
consistently labeled as a malicious bot across the AV vendors except Symantec which fails to identify one of
the samples. In order to maintain completeness, this outlier sample should be labeled similar to the other
samples based on its behavioral profile despite the AV detection failure.
4.3.3 Examining the malware behaviors
Clearly one of the values of any type of automated security system is not to simply provide detailed infor-
mation on individual malware and their behaviors, but also to provide broad analysis on future directions of
13
14. Network Process Files Registry
connects to 80 cmd.exe winhlp32.dat use wininet.dll
connects to 25 IEXPLORE.EXE tasklist32.exe use PRNG
connects to 6667 regedit.exe change.log modify registered applications
connects to 587 tasklist32.exe mirc.ini modify proxy settings
scans port 80 svchost.exe svchost.exe modify mounted drives
Table 11: The top five behaviors observed by type.
malware. Using the behavioral signatures created by our system, we extracted the most prevalent behaviors
for each the various categories of behaviors we monitor. The top five such behaviors in each category are
shown in table 11.
The network behavior seems to conform with agreed notions of how the tasks being performed by most
malware today. Two of the top five network behaviors involve the use of mail ports, presumably for spam.
Port 6667 is a common IRC port and is often used for remote control of the malware. Two of the ports are
HTTP ports used by systems to check for jailed environments, download code via the web, or tunnel command
and control over what is often an unfiltered port. The process behaviors are interesting in that many
process executables are named like common Windows utilities to avoid arousing suspicion (e.g., svchost.exe,
tasklist32.exe). In addition, some malware uses IEXPLORE.EXE directly to launch popup ads and redirect
users to potential phishing sites. This use of existing programs and libraries will make simple anomaly
detection techniques more difficult. The file writes show common executable names and data files written
to the filesystem by malware. For example, the winhlp32.dat file is a data file common to many of bancos
trojans. Registry keys are also fairly interesting indications of behavior and the prevalence of wininet.dll
keys shows heavy use of existing libraries for network support. The writing to PRNG keys indicates a heavy
use of randomization as the seed is updated every time a PRNG-related function is used. As expected the
malware does examine and modify the registered application on a machine, the TCP/IP proxy settings (in
part to avoid AV), and queries mounted drives.
5 Related Work
Our work is the first to apply automated clustering to understand malware behavior using resulting state
changes on the host to identify various malware families. Related work in malware collection, analysis, and
signature generation has primarily explored static and byte-level signatures [25, 19] focusing on invariant
content. Content-based signatures are insufficient to cope with emerging threats due to intentional evasion.
Behavioral analysis has been proposed as a solution to deal with polymorphism and metamorphism, where
malware changes its visible instruction sequence (typically the decryptor routine) as it spreads. For example
Jordan [13] argues that metamorphism can be overcome through emulating malware executables and then
coalescing higher-level actions. These higher-level actions, or behaviors, are what we attempt to use to
overcome metamorphism in this work. Similar to our work, emulating malware to discover spyware behavior
by using anti-spyware tools has been used in measurements studies [23].
There are several abstraction layers at which behavioral profiles can be created. Previous work has
focused on lower layers, such as individual system calls [17, 11, 1], instruction-based code templates [6], the
initial code run on malware infection (shellcode) [20], and network connection and session behavior [34].
Such behavior needs to be effectively elicited. For example, recent work by Royal et al. [29] automates
hidden-code extraction of unpack-executing malware. In our work, we chose a higher abstraction layer for
several reasons. In considering the actions of malware, it is not the individual system calls that define the
significant actions that a piece of malware inflicts upon the infected host, rather, it is the resulting changes
in state of the host. Also, although lower levels may allow signatures that differentiate malware, they do
not provide semantic value in explaining what behaviors are exhibited by a malware variant or family. In
our work, we define malware by what it actually does, and thereby build in more semantic meanings to the
profiles and clusters generated. This influenced our choice of a high abstraction layer at which to create
behavioral profiles.
Various aspects of high-level behavior could be included in the definition of a behavioral profile. Network
behavior may be indicative of malware and has been used to detect malware infections. For example, Ellis
14
15. et al. [10] extracted network-level features such as similar data being sent from one machine to the next, a
tree-like communication pattern, and a server becoming a client. Singh et al. [30] automatically generated
network-level signatures for malware by finding common byte sequences sent from many sources to many
destinations, which is characteristic of worm propagation. In our work, we focus on individual host behavior,
including network connection information but not the data transmitted over the network. Thus we focus
more on the malware behavior on individual host systems instead of the pattern across a network, as behavior
on individual hosts is the basic building block for understanding the overall network behavior.
Recently, Kolter and Maloof [15] studied applying machine learning to classify malicious executables
using n-grams of byte codes. Our use of hierarchical clustering based on normalized compression distance is
a first step at examining how statistical techniques are useful in classifying malware, but the features used
are the resulting state changes on the host to be more resistant to evasion and inaccuracies. Normalized
information distance was proposed by Li et al. [18] as an optimal similarity metric to approximate all other
effective similarity metrics. It was used to cluster various kinds of data to discover families or groups, and
has been applied to domains such as gene expression [27], languages, literature, music, handwritten digits,
and astronomy [7]. In previous work [33] NCD was applied to worm executables directly and the network
traffic generated by worms. Our work applies NCD at a different layer of abstraction. Rather than applying
NCD to the literal malware executables, we apply NCD to the malware behavior. Previous work can capture
similarities in the initial decryption routines of packed or encrypted executables, but may not differentiate
behavioral features that are obscured in the encrypted payload. By emulating the executable, we capture
behavioral features that are not otherwise available.
6 Limitations and Future Work
Our system is not without limitations and shares common weaknesses associated with malware execution
within virtual machine environments. Since the malware samples were executed within VMware, samples that
employ anti-VM evasion techniques may not exhibit their malicious behavior. To mitigate this limitation,
the samples could be run on a real, non-virtualized system, which would be restored to a clean state after
each simulation.
Another limitation is the time period in which behaviors are collected from the malware execution. In
our experiments, each binary was able to run for five minutes before the virtual machine was terminated.
It is possible that certain behaviors were not observed within this period due to time-dependent or delayed
activities. Previous research has been done to detect such time-dependent triggers [9]. A similar limitation
is malware that depends on user input, such as responding to a popup message box, before exhibiting further
malicious behavior as mentioned in [23].
The capabilities and environment of our virtualized system stayed static throughout our experiments
for consistency. However, varying the execution environment by using multiple operating system versions,
including other memory resident programs such as anti-virus protection engines, and varying network con-
nectivity and reachability may yield interesting behaviors not observed in our existing results.
Our choice of a high level of abstraction may limit fine-grained visibility into each of the observed behav-
iors in our system. A path for future work could include low-level details of each state change to supplement
the high-level behavior description. For example, the actual contents of disk writes and transmitted network
packets could be included in a sample’s behavioral profile.
We plan to evaluate the integration of other high-level behavioral reports from existing systems such
as Norman [26] and CWSandbox [5] in the future. We will also investigate further clustering and machine
learning techniques that may better suit these other types of behavioral profiles.
7 Conclusion
In this paper we demonstrated that existing host-based techniques (e.g., anti-virus) fail to provide useful
labels to the malware they encounter. We showed that anti-virus is incomplete in that it fails to detect or
provide labels for between 20 to 62 percent of the malware samples. We noted that when these systems
do provide labels, these labels do not have consistent meanings across families and variants within a single
naming convention as well as across multiple vendors and conventions. Finally, we demonstrated that these
15
16. systems lack conciseness in that the provide too little or in some cases, too much information about a specific
piece of malware.
To address these important limitations we proposed a novel approach to the problem of automated mal-
ware classification and analysis. Our dynamic approach executed the malware in a virtualized environment
and creates a behavioral fingerprint of the malware’s activity. This fingerprint is the set of all the state
changes that are a casual result of the infection including file modified, processes created, and network con-
nections. In order to compare these fingerprints and combine them into meaning group of behaviors, we
applied single-linkage hierarchical clustering of the fingerprints using normalized compress distance as a dis-
tance metric. We demonstrated the usefulness of this technique by applying it to the automated classification
and analysis of 3,700 malware samples collected over the last six months.
References
[1] Debin Gao andDesiree Beck, Julie Connolly” Michael K. Reiter, and Dawn Xiaodong Song. Behavioral
distance measurement using hidden markov models. In RAID, pages 19–40, 2006.
[2] Paul Baecher, Markus Koetter, Thorsten Holz, Maximillian Dornseif, and Felix Freiling. The nepenthes
platform: An efficient approach to collect malware. In 9th International Symposium On Recent Advances
In Intrusion Detection. Springer-Verlag, 2006.
[3] Paul Barford and Vinod Yagneswaran. An inside look at botnets. In To appear in Series: Advances in
Information Security. Springer, 2006.
[4] Desiree Beck and Julie Connolly. The Common Malware Enumeration Initiative. In Virus Bulletin
Conference, October 2006.
[5] Carsten Willems and Thorsten Holz. Cwsandbox. http://www.cwsandbox.org/, 2007.
[6] Mihai Christodorescu, Somesh Jha, Sanjit A. Seshia, Dawn Song, and Randal E. Bryant. Semantics-
aware malware detection. In Proceedings of the 2005 IEEE Symposium on Security and Privacy (Oakland
2005), pages 32–46, Oakland, CA, USA, May 2005. ACM Press.
[7] Rudi Cilibrasi and Paul M. B. Vit´anyi. Clustering by compression. In Information Theory, IEEE
Transactions on, volume 51, pages 1523–1545, 2005.
[8] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algo-
rithms. The MIT Press, Cambridge, MA, 1990.
[9] Jedidiah R. Crandall, Gary Wassermann, Daniela A.˜S. de Oliveira, Zhendong Su, S.˜Felix Wu, and
Frederic T. Chong. Temporal Search: Detecting Hidden Malware Timebombs with Virtual Machines. In
Proceedings of the 12th International Conference on Architectural Support for Programming Languages
and Operating Systems, San Jose, CA, October 2006. ACM Press New York, NY, USA.
[10] Dan Ellis, John Aiken, Kira Attwood, and Scott Tenaglia. A Behavioral Approach to Worm Detection.
In Proceedings of the ACM Workshop on Rapid Malcode (WORM04), October 2004.
[11] Debin Gao, Michael K. Reiter, and Dawn Xiaodong Song. Behavioral distance for intrusion detection. In
Alfonso Valdes and Diego Zamboni, editors, RAID, volume 3858 of Lecture Notes in Computer Science,
pages 63–81. Springer, 2005.
[12] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer, 2001.
[13] Myles Jordan. Anti-virus research - dealing with metamorphism. Virus Bulletin Magazine, October
2002.
[14] Samuel T. King and Peter M. Chen. Backtracking intrusions. In Proceedings of the 19th ACM Symposium
on Operating Systems Principles (SOSP’03), pages 223–236, Bolton Landing, NY, USA, October 2003.
ACM.
16
17. [15] J. Zico Kolter and Marcus A. Maloof. Learning to Detect and Classify Malicious Executables in the
Wild. Journal of Machine Learning Research, 2007.
[16] Eleftherios Koutsofios and Stephen C. North. Drawing graphs with dot. Technical report, AT&T Bell
Laboratories, Murray Hill, NJ, 8 October 1993.
[17] Tony Lee and Jigar J. Mody. Behavioral classification. In Proceedings of EICAR 2006, April 2006.
[18] Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul Vit´anyi. The similarity metric. In SODA ’03: Proceedings
of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, pages 863–872, Philadelphia,
PA, USA, 2003. Society for Industrial and Applied Mathematics.
[19] Z. Li, M. Sanghi, Y. Chen, M. Kao, and B. Chavez. Hamsa: Fast Signature Generation for Zero-day
Polymorphic Worms with Provable Attack Resilience. In Proc. of IEEE Symposium on Security and
Privacy, 2006.
[20] Justin Ma, John Dunagan, Helen Wang, Stefan Savage, and Geoffrey Voelker. Finding Diversity in
Remote Code Injection Exploits. Proceedings of the USENIX/ACM Internet Measurement Conference,
October 2006.
[21] McAfee. W32/Sdbot.worm. http://vil.nai.com/vil/content/v 100454.htm, April 2003.
[22] Microsoft. Microsoft security intelligence report: January-june 2006.
http://www.microsoft.com/technet/security/default.mspx, October 2006.
[23] Alex Moshchuk, Tanya Bragin, Steven D. Gribble, and Henry M. Levy. A Crawler-based Study of
Spyware in the Web. In Proceedings of the Network and Distributed System Security Symposium (NDSS),
San Diego, CA, 2006.
[24] Arbor Networks. Arbor malware library (AML). http://www.arbornetworks.com/, 2006.
[25] James Newsome, Brad Karp, and Dawn Song. Polygraph: Automatically generating signatures for
polymorphic worms. Proceedings 2005 IEEE Symposium on Security and Privacy, Oakland, CA, USA,
May 8–11, 2005, 2005.
[26] Norman Solutions. Norman sandbox whitepaper. http://www.norman.no/, 2003.
[27] Matti Nykter, Olli Yli-Harja, and Ilya Shmulevich. Normalized compression distance for gene expression
analysis. In Workshop on Genomic Signal Processing and Statistics (GENSIPS), May 2005.
[28] Matthew B. Prince, Benjamin M. Dahl, Lee Holloway, Arthur M. Keller, and Eric Langheinrich. Un-
derstanding how spammers steal your e-mail address: An analysis of the first six months of data from
project honey pot. In Second Conference on Email and Anti-Spam (CEAS 2005), July 2005.
[29] Paul Royal, Mitch Halpin, David Dagon, Robert Edmonds, and Wenke Lee. PolyUnpack: Automating
the Hidden-Code Extraction of Unpack-Executing Malware. In The 22th Annual Computer Security
Applications Conference (ACSAC 2006), Miami Beach, FL, December 2006.
[30] Sumeet Singh, Cristian Estan, George Varghese, and Stefan Savage. Automated worm fingerprinting.
In 6th Symposium on Operating Systems Design and Implementation (OSDI ’04), pages 45–60, San
Francisco, CA, December 6–8 2004.
[31] Brian Walters. VMware virtual platform. j-LINUX-J, 63, Jul. 1999.
[32] Yi-Min Wang, Doug Beck, Xuxian Jiang, Roussi Roussev, Chad Verbowski, Shuo Chen, and Samuel T.
King. Automated web patrol with strider honeymonkeys: Finding web sites that exploit browser vul-
nerabilities. In Proceedings of the Network and Distributed System Security Symposium, NDSS 2006,
San Diego, California, USA, 2006.
[33] Stephanie Wehner. Analyzing worms and network traffic using compression. Technical report, CWI,
Amsterdam, 2005.
17
18. [34] Vinod Yegneswaran, Jonathon T. Giffin, Paul Barford, and Somesh Jha. An Architecture for Generating
Semantics-Aware Signatures. In Proceedings of the 14th USENIX Security Symposium, pages 97–112,
Baltimore, MD, USA, August 2005.
18