This document is a thesis submitted by Iason Papapanagiotakis-Bousy to University College London for the degree of Master of Science in Information Security. The thesis defines external metamorphic obfuscation engines using term rewriting systems and analyzes the problem of learning the rewriting rules of such obfuscations given a finite set of malware samples. Specifically, it proves the impossibility of exactly learning the rules but provides an algorithm for approximating the rules under certain assumptions. The work aims to lay the foundations for further research on analyzing metamorphic malware obfuscations.
Proceedings of the 50th Hawaii International Conference on System Sciences | 2017
Discovering Malware with Time Series Shapelets
Om P. Patri
University of Southern California
Los Angeles, CA 90089
patri@usc.edu
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR ijcax
The Pattern classification system classifies the pattern into feature space within a boundary. In case
adversarial applications use, for example Spam Filtering, the Network Intrusion Detection System (NIDS),
Biometric Authentication, the pattern classification systems are used. Spam filtering is an adversary
application in which data can be employed by humans to attenuate perspective operations. To appraise the
security issue related Spam Filtering voluminous machine learning systems. We presented a framework for
the experimental evaluation of the classifier security in an adversarial environments, that combines and
constructs on the arms race and security by design, Adversary modelling and Data distribution under
attack. Furthermore, we presented a SVM, LR and MILR classifier for classification to categorize email as
legitimate (ham) or spam emails on the basis of thee text samples.
In procedural programs, logic follows procedures and instructions execute sequentially, while in object-oriented programs (OOP), the unit is the object which combines data and code. OOP programs encapsulate data within objects and assure security, while procedural programs expose data. Encapsulation binds code and data, inheritance allows acquiring properties of another object, and polymorphism allows a general interface for class actions. Initialization can only occur once while assignment can occur multiple times. OOP organizes programs around objects and well-defined interfaces to data, with objects controlling access to code.
Architecture of a morphological malware detectorUltraUploader
This document proposes an architecture for a morphological malware detector that combines syntactic and semantic analysis. It builds an efficient signature matching engine using tree automata techniques to represent control flow graphs (CFG). It also describes a graph rewriting engine to handle common malware mutations. The detector extracts CFGs from malware binaries to generate signatures, which are compiled into a minimal automaton database for efficient matching. Experiments showed promising results with a low false positive rate.
This document discusses using machine learning to classify malware into families based on the DREBIN dataset. It covers:
1. Preprocessing the dataset, including integer encoding and one-hot encoding to convert categorical data to numeric form for modeling.
2. Addressing overfitting by splitting the data into training and test sets and using cross-validation.
3. Using classifiers like Random Forest and SVM with strategies like one-vs-all and one-vs-one to perform multiclass classification of malware families.
4. The process of using binary classifiers for each family first, then combining the results to classify malware into the appropriate family.
A FPGA-Based Deep Packet Inspection Engine for Network Intrusion Detection Sy...Muhammad Nasiri
This document summarizes a paper that proposes an FPGA-based deep packet inspection engine for network intrusion detection systems. The paper describes using FPGA for parallel processing of multiple signature patterns, including static strings and regular expressions. It presents architectures for handling one, correlated, and independent patterns. Simulation results show the proposed engine can process packets at line rate and maintain throughput even with 100% malicious traffic, unlike the software-based Snort detection engine. The goal is to speed up intrusion detection by offloading deep packet inspection to reconfigurable FPGA hardware.
Packet Classification using Support Vector Machines with String KernelsIJERA Editor
Since the inception of internet many methods have been devised to keep untrusted and malicious packets away
from a user’s system . The traffic / packet classification can be used
as an important tool to detect intrusion in the system. Using Machine Learning as an efficient statistical based
approach for classifying packets is a novel method in practice today . This paper emphasizes upon using an
advanced string kernel method within a support vector machine to classify packets .
There exists a paper related to a similar problem using Machine Learning [2]. But the researches mentioned in
their paper are not up-to date and doesn’t account for modern day
string kernels that are much more efficient . My work extends their research by introducing different approaches
to classify encrypted / unencrypted traffic / packets .
A Novel Classification via Clustering Method for Anomaly Based Network Intrus...IDES Editor
Intrusion detection in the internet is an active
area of research. Intruders can be classified into two
types, namely; external intruders who are unauthorized
users of the computers they attack, and internal
intruders, who have permission to access the system but
with some restrictions. The aim of this paper is to present
a methodology to recognize attacks during the normal
activities in a system. A novel classification via sequential
information bottleneck (sIB) clustering algorithm has
been proposed to build an efficient anomaly based
network intrusion detection model. We have compared
our proposed method with other clustering algorithms
like X-Means, Farthest First, Filtered clusters, DBSCAN,
K-Means, and EM (Expectation-Maximization)
clustering in order to find the suitability of our proposed
algorithm. A subset of KDDCup 1999 intrusion detection
benchmark dataset has been used for the experiment.
Results show that the proposed method is efficient in
terms of detection accuracy, low false positive rate in
comparison to the other existing methods.
Proceedings of the 50th Hawaii International Conference on System Sciences | 2017
Discovering Malware with Time Series Shapelets
Om P. Patri
University of Southern California
Los Angeles, CA 90089
patri@usc.edu
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR ijcax
The Pattern classification system classifies the pattern into feature space within a boundary. In case
adversarial applications use, for example Spam Filtering, the Network Intrusion Detection System (NIDS),
Biometric Authentication, the pattern classification systems are used. Spam filtering is an adversary
application in which data can be employed by humans to attenuate perspective operations. To appraise the
security issue related Spam Filtering voluminous machine learning systems. We presented a framework for
the experimental evaluation of the classifier security in an adversarial environments, that combines and
constructs on the arms race and security by design, Adversary modelling and Data distribution under
attack. Furthermore, we presented a SVM, LR and MILR classifier for classification to categorize email as
legitimate (ham) or spam emails on the basis of thee text samples.
In procedural programs, logic follows procedures and instructions execute sequentially, while in object-oriented programs (OOP), the unit is the object which combines data and code. OOP programs encapsulate data within objects and assure security, while procedural programs expose data. Encapsulation binds code and data, inheritance allows acquiring properties of another object, and polymorphism allows a general interface for class actions. Initialization can only occur once while assignment can occur multiple times. OOP organizes programs around objects and well-defined interfaces to data, with objects controlling access to code.
Architecture of a morphological malware detectorUltraUploader
This document proposes an architecture for a morphological malware detector that combines syntactic and semantic analysis. It builds an efficient signature matching engine using tree automata techniques to represent control flow graphs (CFG). It also describes a graph rewriting engine to handle common malware mutations. The detector extracts CFGs from malware binaries to generate signatures, which are compiled into a minimal automaton database for efficient matching. Experiments showed promising results with a low false positive rate.
This document discusses using machine learning to classify malware into families based on the DREBIN dataset. It covers:
1. Preprocessing the dataset, including integer encoding and one-hot encoding to convert categorical data to numeric form for modeling.
2. Addressing overfitting by splitting the data into training and test sets and using cross-validation.
3. Using classifiers like Random Forest and SVM with strategies like one-vs-all and one-vs-one to perform multiclass classification of malware families.
4. The process of using binary classifiers for each family first, then combining the results to classify malware into the appropriate family.
A FPGA-Based Deep Packet Inspection Engine for Network Intrusion Detection Sy...Muhammad Nasiri
This document summarizes a paper that proposes an FPGA-based deep packet inspection engine for network intrusion detection systems. The paper describes using FPGA for parallel processing of multiple signature patterns, including static strings and regular expressions. It presents architectures for handling one, correlated, and independent patterns. Simulation results show the proposed engine can process packets at line rate and maintain throughput even with 100% malicious traffic, unlike the software-based Snort detection engine. The goal is to speed up intrusion detection by offloading deep packet inspection to reconfigurable FPGA hardware.
Packet Classification using Support Vector Machines with String KernelsIJERA Editor
Since the inception of internet many methods have been devised to keep untrusted and malicious packets away
from a user’s system . The traffic / packet classification can be used
as an important tool to detect intrusion in the system. Using Machine Learning as an efficient statistical based
approach for classifying packets is a novel method in practice today . This paper emphasizes upon using an
advanced string kernel method within a support vector machine to classify packets .
There exists a paper related to a similar problem using Machine Learning [2]. But the researches mentioned in
their paper are not up-to date and doesn’t account for modern day
string kernels that are much more efficient . My work extends their research by introducing different approaches
to classify encrypted / unencrypted traffic / packets .
A Novel Classification via Clustering Method for Anomaly Based Network Intrus...IDES Editor
Intrusion detection in the internet is an active
area of research. Intruders can be classified into two
types, namely; external intruders who are unauthorized
users of the computers they attack, and internal
intruders, who have permission to access the system but
with some restrictions. The aim of this paper is to present
a methodology to recognize attacks during the normal
activities in a system. A novel classification via sequential
information bottleneck (sIB) clustering algorithm has
been proposed to build an efficient anomaly based
network intrusion detection model. We have compared
our proposed method with other clustering algorithms
like X-Means, Farthest First, Filtered clusters, DBSCAN,
K-Means, and EM (Expectation-Maximization)
clustering in order to find the suitability of our proposed
algorithm. A subset of KDDCup 1999 intrusion detection
benchmark dataset has been used for the experiment.
Results show that the proposed method is efficient in
terms of detection accuracy, low false positive rate in
comparison to the other existing methods.
MotifGP is a motif discovery tool that uses multi-objective evolutionary algorithms to find DNA motifs. It evolves candidate motif solutions as network expressions using strongly typed genetic programming. MotifGP scores candidates based on how well they discriminate between input and control sequences while maintaining coverage of input sequences. When tested on 13 datasets, MotifGP identified known motifs and fully overlapped reference motifs in the databases, outperforming the existing DREME tool. Future work will explore different fitness functions and evolutionary methods to further improve motif discovery.
A Survey of String Matching AlgorithmsIJERA Editor
The concept of string matching algorithms are playing an important role of string algorithms in finding a place where one or several strings (patterns) are found in a large body of text (e.g., data streaming, a sentence, a paragraph, a book, etc.). Its application covers a wide range, including intrusion detection Systems (IDS) in computer networks, applications in bioinformatics, detecting plagiarism, information security, pattern recognition, document matching and text mining. In this paper we present a short survey for well-known and recent updated and hybrid string matching algorithms. These algorithms can be divided into two major categories, known as exact string matching and approximate string matching. The string matching classification criteria was selected to highlight important features of matching strategies, in order to identify challenges and vulnerabilities.
This document proposes an approach to fuzz testing software with no prior knowledge of the input format or binary code. It involves using static analysis metrics like cyclomatic complexity and loop detection to identify functions of interest. Data tainting is then used to track how user input propagates through the binary. In-memory fuzzing mutates input data at locations identified via tainting. This aims to limit human intervention and reduce false positives compared to existing in-memory fuzzing techniques. The approach combines static analysis, data tainting, and in-memory fuzzing in a new way to build an intelligent fuzzer requiring minimal instrumentation.
This document discusses using a semi-supervised learning approach with a transductive support vector machine (TSVM) model for sentiment analysis of blogs relating to organized crime in Mexico. It gathered data from a controversial blog about the drug trade, cleaned the text, labeled some instances, and trained a TSVM model. A 10-fold cross validation experiment showed the TSVM achieved higher accuracy than a standard SVM. While the TSVM was slower, its accuracy was promising, especially with fewer labeled instances. Future work could involve subjectivity classification before polarity and addressing errors in the current approach.
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTINGIJNSA Journal
This document proposes a method for fast detection of DDoS attacks using non-adaptive group testing (NAGT). It begins with background on DDoS attacks and group testing techniques. It then describes using a strongly explicit d-disjunct matrix in NAGT to map IP addresses to "tests" performed by routers. The router counters would indicate potential hot items (attackers or victims). Two decoding algorithms are presented to identify the hot items from the test results with poly-log time complexity meeting data stream requirements. The method aims to provide early warning of DDoS attacks through efficient group testing of IP packets.
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...Daniel Valcarce
Slides of the presentation given at ECIR 2016 for the following paper:
Daniel Valcarce, Javier Parapar, Alvaro Barreiro: Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recommendation. ECIR 2016: 602-613
http://dx.doi.org/10.1007/978-3-319-30671-1_44
A semantics based approach to malware detectionUltraUploader
This document proposes a semantics-based framework for malware detection that reasons about program behavior. It defines what it means for a detector to be sound and complete with respect to obfuscations. The framework uses trace semantics to characterize program behaviors and abstract interpretation to ignore irrelevant details. It shows this framework in action by proving a previous semantics-aware malware detector is complete against common obfuscations.
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERijnlc
This document presents an adaptive log file parser that uses semantics and hidden Markov models. It first clusters log file lines based on semantics to limit unstructured text. It then builds a hidden Markov model to represent parsing patterns, with log entries as states and extracted values as emissions. When applied to a new system, it adapts the model's transition and emission probabilities to fit the new data. The approach achieves over 99.99% accuracy when trained on one system and applied to another with slightly different log patterns.
Machine learning can be applied in various areas of computer security like network security, endpoint protection, application security, user behavior analysis, and process behavior analysis. Some common machine learning techniques that are useful for security include regression for prediction and detection of anomalies, classification to identify threats and attacks, and clustering for forensic analysis and to detect outliers. Example applications of machine learning in security include using regression to detect anomalies in network traffic, classification to identify malware, and clustering to separate malware from legitimate files.
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...CSCJournals
Biclusters are required to analyzing gene expression patterns of genes comparing rows in expression profiles and analyzing expression profiles of samples by comparing columns in gene expression matrix. In the process of biclustering we need to cluster genes and samples. The algorithm presented in this paper is based upon the two-way clustering approach in which the genes and samples are clustered using parallel fuzzy C-means clustering using message passing interface, we call it MFCM. MFCM applied for clustering on genes and samples which maximize membership function values of the data set. It is a parallelized rework of a parallel fuzzy two-way clustering algorithm for microarray gene expression data [9], to study the efficiency and parallelization improvement of the algorithm. The algorithm uses gene entropy measure to filter the clustered data to find biclusters. The method is able to get highly correlated biclusters of the gene expression dataset.
Applying Machine Learning to Software Clusteringbutest
This document discusses applying machine learning techniques to the problem of automatically clustering source code files into subsystems. Specifically, it formulates software clustering as a supervised machine learning problem, where a learner is trained on a subset of files that have been manually categorized and then aims to generalize that categorization to other files. The document tests two machine learning algorithms - Naive Bayes and Nearest Neighbor - on decompositions of three software systems, with the Nearest Neighbor algorithm achieving the best results.
From Free-text User Reviews to Product Recommendation using Paragraph Vectors...Γιώργος Αλεξανδρίδης
Presentation slides at the Workshop on e-Commerce and NLP (ECNLP 2019), held during The Web Conference 2019 (WWW19) at San Francisco, CA on May 14th, 2019
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...Hiroki Shimanaka
(1) The document presents an unsupervised method called Sent2Vec to learn sentence embeddings using compositional n-gram features. (2) Sent2Vec extends the continuous bag-of-words model to train sentence embeddings by composing word vectors with n-gram embeddings. (3) Experimental results show Sent2Vec outperforms other unsupervised models on most benchmark tasks, highlighting the robustness of the sentence embeddings produced.
This document summarizes research applying deep learning techniques to predict epigenomic enhancer regions from DNA sequences. Four models - variations of Basset, DeepSea, DanQ, and a custom Greenside-Basset model - were trained on a dataset from the NIH Roadmap Epigenomics Mapping Consortium to label 1000 base pair sequences as active or inactive for 57 cell types. The Basset variation achieved the best balanced accuracy of 72.8% and had the highest precision and F1 scores, though it still weighted negative examples heavily, as precision was not strong. Experimenting with different embeddings, loss functions, and architectures helped narrow the models best suited for this sequence labeling task.
This document describes the design and implementation of a mobile robot built with Lego bricks. It discusses the mechanical design including gears, frame, sensor placement, and weight distribution. It then covers the software architecture and various algorithms used for motion control, collision avoidance, mapping, path planning, vision, localization, and object manipulation. Key techniques include implementing a unicycle dynamic model for motion, using IR sensors and sonar for collision avoidance, representing the environment as a graph for path finding, and detecting objects and landmarks through color filtering and edge detection. The robot is able to autonomously navigate environments, identify rooms, locate objects, and return them to a base location.
This document summarizes the accomplishments and future work of a project to create an autonomous robot system for assessing building lighting. The system created 2D and 3D maps, detected and located lights using a camera, identified light types with spectroscopy, and integrated these capabilities into a fully mobile prototype robot. Future work includes adding stereo cameras for improved light detection and position accuracy, an IMU for more accurate positioning, and miniaturizing the system to mount on a UAV. The project addressed designing requirements to map areas, locate and identify all light types, and create a fully mobile package.
PC-based mobile robot navigation sytemANKIT SURATI
The document describes the design and components of a PC-based mobile robot for navigation. It uses a vision system with image processing and feature matching to navigate. A fuzzy logic controller computes the speed and angular speed needed by the two motors. An obstacle avoidance algorithm detects obstacles using pixel appearance differences. The robot is powered by a 12V battery and controlled through a software interface on a PC. Testing showed the fuzzy controller could achieve desired turns to navigate autonomously.
DESIGN AND IMPLEMENTATION OF PATH PLANNING ALGORITHM NITISH K
The document discusses the design and implementation of a path planning algorithm for a wheeled mobile robot in a known dynamic environment. It describes using an A* algorithm at a central control station to calculate the shortest path for the robot. If obstacles are detected, the robot's location and obstacle information is sent to update the environment map. The control station then recalculates the new shortest path for the robot. The system was tested experimentally and in simulation, showing it can effectively calculate the shortest path in a dynamic environment.
Implementation of D* Path Planning Algorithm with NXT LEGO Mindstorms Kit for...idescitation
Autonomous Robots use various Path Planning
algorithms to navigate, to the target point. In the real world
situation robot may not have a complete picture of the obstacles
in its environment. The classical path planning algorithms
such as A*, D* are cost based where the shortest path to the
target is calculated based on the distance to be travelled. In
order to provide real time shortest path solutions, cost
computation has to be redone whenever new obstacles are
identified. D* is a potential search algorithm, capable of
planning shortest path in unknown, partially known and
changing environments. This paper brings out the simulation
of D* algorithm in C++ and the results for different test cases.
It also elucidates the implementation of the algorithm with
NXT LEGO Mindstorms kit using RobotC language and
evaluation in real time scenario.
The document summarizes the Advanced Robotic Mapper (ARM) project. The ARM is a robot that can build a map of its environment. It includes components for position and orientation tracking, mapping sensors, a main controller and LCD, and software for mapping, navigation and user interfacing. Initial tests showed the robot could accurately map a surface 4+ feet away and travel a square path, but position tracking degraded over time. The compass was unstable when motors operated due to magnetic interference.
The document presents a complete Android-based framework for automatically identifying a user's transportation mode using GPS trajectories and accelerometer measurements from a smartphone. The framework includes an architecture, design, implementation, user interface, and algorithms for transportation mode identification. It applies segmentation, simplification, and machine learning classification techniques to collected GPS and accelerometer data to identify modes like walking, running, and in-vehicle transportation. The system was evaluated on real and simulated data, achieving an overall accuracy of around 85% for identifying transportation modes, outperforming the Google Activity Recognition API.
MotifGP is a motif discovery tool that uses multi-objective evolutionary algorithms to find DNA motifs. It evolves candidate motif solutions as network expressions using strongly typed genetic programming. MotifGP scores candidates based on how well they discriminate between input and control sequences while maintaining coverage of input sequences. When tested on 13 datasets, MotifGP identified known motifs and fully overlapped reference motifs in the databases, outperforming the existing DREME tool. Future work will explore different fitness functions and evolutionary methods to further improve motif discovery.
A Survey of String Matching AlgorithmsIJERA Editor
The concept of string matching algorithms are playing an important role of string algorithms in finding a place where one or several strings (patterns) are found in a large body of text (e.g., data streaming, a sentence, a paragraph, a book, etc.). Its application covers a wide range, including intrusion detection Systems (IDS) in computer networks, applications in bioinformatics, detecting plagiarism, information security, pattern recognition, document matching and text mining. In this paper we present a short survey for well-known and recent updated and hybrid string matching algorithms. These algorithms can be divided into two major categories, known as exact string matching and approximate string matching. The string matching classification criteria was selected to highlight important features of matching strategies, in order to identify challenges and vulnerabilities.
This document proposes an approach to fuzz testing software with no prior knowledge of the input format or binary code. It involves using static analysis metrics like cyclomatic complexity and loop detection to identify functions of interest. Data tainting is then used to track how user input propagates through the binary. In-memory fuzzing mutates input data at locations identified via tainting. This aims to limit human intervention and reduce false positives compared to existing in-memory fuzzing techniques. The approach combines static analysis, data tainting, and in-memory fuzzing in a new way to build an intelligent fuzzer requiring minimal instrumentation.
This document discusses using a semi-supervised learning approach with a transductive support vector machine (TSVM) model for sentiment analysis of blogs relating to organized crime in Mexico. It gathered data from a controversial blog about the drug trade, cleaned the text, labeled some instances, and trained a TSVM model. A 10-fold cross validation experiment showed the TSVM achieved higher accuracy than a standard SVM. While the TSVM was slower, its accuracy was promising, especially with fewer labeled instances. Future work could involve subjectivity classification before polarity and addressing errors in the current approach.
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTINGIJNSA Journal
This document proposes a method for fast detection of DDoS attacks using non-adaptive group testing (NAGT). It begins with background on DDoS attacks and group testing techniques. It then describes using a strongly explicit d-disjunct matrix in NAGT to map IP addresses to "tests" performed by routers. The router counters would indicate potential hot items (attackers or victims). Two decoding algorithms are presented to identify the hot items from the test results with poly-log time complexity meeting data stream requirements. The method aims to provide early warning of DDoS attacks through efficient group testing of IP packets.
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...Daniel Valcarce
Slides of the presentation given at ECIR 2016 for the following paper:
Daniel Valcarce, Javier Parapar, Alvaro Barreiro: Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recommendation. ECIR 2016: 602-613
http://dx.doi.org/10.1007/978-3-319-30671-1_44
A semantics based approach to malware detectionUltraUploader
This document proposes a semantics-based framework for malware detection that reasons about program behavior. It defines what it means for a detector to be sound and complete with respect to obfuscations. The framework uses trace semantics to characterize program behaviors and abstract interpretation to ignore irrelevant details. It shows this framework in action by proving a previous semantics-aware malware detector is complete against common obfuscations.
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERijnlc
This document presents an adaptive log file parser that uses semantics and hidden Markov models. It first clusters log file lines based on semantics to limit unstructured text. It then builds a hidden Markov model to represent parsing patterns, with log entries as states and extracted values as emissions. When applied to a new system, it adapts the model's transition and emission probabilities to fit the new data. The approach achieves over 99.99% accuracy when trained on one system and applied to another with slightly different log patterns.
Machine learning can be applied in various areas of computer security like network security, endpoint protection, application security, user behavior analysis, and process behavior analysis. Some common machine learning techniques that are useful for security include regression for prediction and detection of anomalies, classification to identify threats and attacks, and clustering for forensic analysis and to detect outliers. Example applications of machine learning in security include using regression to detect anomalies in network traffic, classification to identify malware, and clustering to separate malware from legitimate files.
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...CSCJournals
Biclusters are required to analyzing gene expression patterns of genes comparing rows in expression profiles and analyzing expression profiles of samples by comparing columns in gene expression matrix. In the process of biclustering we need to cluster genes and samples. The algorithm presented in this paper is based upon the two-way clustering approach in which the genes and samples are clustered using parallel fuzzy C-means clustering using message passing interface, we call it MFCM. MFCM applied for clustering on genes and samples which maximize membership function values of the data set. It is a parallelized rework of a parallel fuzzy two-way clustering algorithm for microarray gene expression data [9], to study the efficiency and parallelization improvement of the algorithm. The algorithm uses gene entropy measure to filter the clustered data to find biclusters. The method is able to get highly correlated biclusters of the gene expression dataset.
Applying Machine Learning to Software Clusteringbutest
This document discusses applying machine learning techniques to the problem of automatically clustering source code files into subsystems. Specifically, it formulates software clustering as a supervised machine learning problem, where a learner is trained on a subset of files that have been manually categorized and then aims to generalize that categorization to other files. The document tests two machine learning algorithms - Naive Bayes and Nearest Neighbor - on decompositions of three software systems, with the Nearest Neighbor algorithm achieving the best results.
From Free-text User Reviews to Product Recommendation using Paragraph Vectors...Γιώργος Αλεξανδρίδης
Presentation slides at the Workshop on e-Commerce and NLP (ECNLP 2019), held during The Web Conference 2019 (WWW19) at San Francisco, CA on May 14th, 2019
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...Hiroki Shimanaka
(1) The document presents an unsupervised method called Sent2Vec to learn sentence embeddings using compositional n-gram features. (2) Sent2Vec extends the continuous bag-of-words model to train sentence embeddings by composing word vectors with n-gram embeddings. (3) Experimental results show Sent2Vec outperforms other unsupervised models on most benchmark tasks, highlighting the robustness of the sentence embeddings produced.
This document summarizes research applying deep learning techniques to predict epigenomic enhancer regions from DNA sequences. Four models - variations of Basset, DeepSea, DanQ, and a custom Greenside-Basset model - were trained on a dataset from the NIH Roadmap Epigenomics Mapping Consortium to label 1000 base pair sequences as active or inactive for 57 cell types. The Basset variation achieved the best balanced accuracy of 72.8% and had the highest precision and F1 scores, though it still weighted negative examples heavily, as precision was not strong. Experimenting with different embeddings, loss functions, and architectures helped narrow the models best suited for this sequence labeling task.
This document describes the design and implementation of a mobile robot built with Lego bricks. It discusses the mechanical design including gears, frame, sensor placement, and weight distribution. It then covers the software architecture and various algorithms used for motion control, collision avoidance, mapping, path planning, vision, localization, and object manipulation. Key techniques include implementing a unicycle dynamic model for motion, using IR sensors and sonar for collision avoidance, representing the environment as a graph for path finding, and detecting objects and landmarks through color filtering and edge detection. The robot is able to autonomously navigate environments, identify rooms, locate objects, and return them to a base location.
This document summarizes the accomplishments and future work of a project to create an autonomous robot system for assessing building lighting. The system created 2D and 3D maps, detected and located lights using a camera, identified light types with spectroscopy, and integrated these capabilities into a fully mobile prototype robot. Future work includes adding stereo cameras for improved light detection and position accuracy, an IMU for more accurate positioning, and miniaturizing the system to mount on a UAV. The project addressed designing requirements to map areas, locate and identify all light types, and create a fully mobile package.
PC-based mobile robot navigation sytemANKIT SURATI
The document describes the design and components of a PC-based mobile robot for navigation. It uses a vision system with image processing and feature matching to navigate. A fuzzy logic controller computes the speed and angular speed needed by the two motors. An obstacle avoidance algorithm detects obstacles using pixel appearance differences. The robot is powered by a 12V battery and controlled through a software interface on a PC. Testing showed the fuzzy controller could achieve desired turns to navigate autonomously.
DESIGN AND IMPLEMENTATION OF PATH PLANNING ALGORITHM NITISH K
The document discusses the design and implementation of a path planning algorithm for a wheeled mobile robot in a known dynamic environment. It describes using an A* algorithm at a central control station to calculate the shortest path for the robot. If obstacles are detected, the robot's location and obstacle information is sent to update the environment map. The control station then recalculates the new shortest path for the robot. The system was tested experimentally and in simulation, showing it can effectively calculate the shortest path in a dynamic environment.
Implementation of D* Path Planning Algorithm with NXT LEGO Mindstorms Kit for...idescitation
Autonomous Robots use various Path Planning
algorithms to navigate, to the target point. In the real world
situation robot may not have a complete picture of the obstacles
in its environment. The classical path planning algorithms
such as A*, D* are cost based where the shortest path to the
target is calculated based on the distance to be travelled. In
order to provide real time shortest path solutions, cost
computation has to be redone whenever new obstacles are
identified. D* is a potential search algorithm, capable of
planning shortest path in unknown, partially known and
changing environments. This paper brings out the simulation
of D* algorithm in C++ and the results for different test cases.
It also elucidates the implementation of the algorithm with
NXT LEGO Mindstorms kit using RobotC language and
evaluation in real time scenario.
The document summarizes the Advanced Robotic Mapper (ARM) project. The ARM is a robot that can build a map of its environment. It includes components for position and orientation tracking, mapping sensors, a main controller and LCD, and software for mapping, navigation and user interfacing. Initial tests showed the robot could accurately map a surface 4+ feet away and travel a square path, but position tracking degraded over time. The compass was unstable when motors operated due to magnetic interference.
The document presents a complete Android-based framework for automatically identifying a user's transportation mode using GPS trajectories and accelerometer measurements from a smartphone. The framework includes an architecture, design, implementation, user interface, and algorithms for transportation mode identification. It applies segmentation, simplification, and machine learning classification techniques to collected GPS and accelerometer data to identify modes like walking, running, and in-vehicle transportation. The system was evaluated on real and simulated data, achieving an overall accuracy of around 85% for identifying transportation modes, outperforming the Google Activity Recognition API.
This document describes a rescue robot project created by a team of 6 students guided by Mr. CH Rajesh Babu. It provides the team members' names and location as Rajahmundry. It then goes on to explain what a rescue robot is and provides a block diagram of its components. These include a robotic arm, metal detector, pulse width modulation (PWM), passive infrared (PIR) sensors, ultrasonic transducer, and accelerometer. It describes how these sensors work and the feasibility of controlling the robot via an Android mobile app or HTML page. Finally, it lists some potential applications for rescue robots, such as bomb dismantling, detecting earthquakes, acting as a fire extinguisher, and landmine detection
This document describes using fuzzy logic for robot navigation. Ultrasonic sensors are mounted on a robot to detect obstacles to the right, front, and left. Fuzzy logic is used to coordinate multiple reactive behaviors like obstacle avoidance, following edges, and moving toward a target. Simulation results show the strategy allows efficient navigation in complex environments. The robot can avoid obstacles, decelerate at turns, escape U-shapes, and reach targets using integrated ultrasonic sensors and fuzzy behavior control.
Reactive Navigation of Autonomous Mobile Robot Using Neuro-Fuzzy SystemWaqas Tariq
Neuro-fuzzy systems have been used for robot navigation applications because of their ability to exert human like expertise and to utilize acquired knowledge to develop autonomous navigation strategies. In this paper, neuro-fuzzy based system is proposed for reactive navigation of a mobile robot using behavior based control. The proposed algorithm uses discrete sampling based optimal training of neural network. With a view to ascertain the efficacy of proposed system; the proposed neuro-fuzzy system’s performance is compared to that of neural and fuzzy based approaches. Simulation results along with detailed behavior analysis show effectiveness of our algorithm in all kind of obstacle environments.
Path Planning for Mobile Robot Navigation Using Voronoi Diagram and Fast Marc...Waqas Tariq
For navigation in complex environments, a robot needs to reach a compromise between the need for having efficient and optimized trajectories and the need for reacting to unexpected events. This paper presents a new sensor-based Path Planner which results in a fast local or global motion planning able to incorporate the new obstacle information. In the first step the safest areas in the environment are extracted by means of a Voronoi Diagram. In the second step the Fast Marching Method is applied to the Voronoi extracted areas in order to obtain the path. The method combines map-based and sensor-based planning operations to provide a reliable motion plan, while it operates at the sensor frequency. The main characteristics are speed and reliability, since the map dimensions are reduced to an almost unidimensional map and this map represents the safest areas in the environment for moving the robot. In addition, the Voronoi Diagram can be calculated in open areas, and with all kind of shaped obstacles, which allows to apply the proposed planning method in complex environments where other methods of planning based on Voronoi do not work.
The document is a report on using fuzzy logic for robotic control. It discusses fuzzy sets and membership functions, fuzzy inference systems, and how fuzzy logic can be used for behaviors like obstacle avoidance, following edges, and target steering. The report provides examples of how fuzzy logic controllers allow incorporating human expertise to control systems without precise mathematical models. It also discusses applications of fuzzy logic for robot control that have been presented in other literature.
A rescue robot is a robot that has been designed for the purpose of rescuing people.Common situations that employ rescue robots are mining accidents, urban disasters, hostage situations, and explosions.Rescue robots in development are being made with abilities such as searching, reconnaissance and mapping, removing or shoring up rubble, delivery of supplies, medical treatment, and evacuation of casualties. Even with all these ideas coming about there are still some technical challenges that remain.
The Contents include...
What is a rescue robot ?
The deliverables required for a rescue robot
Needs & Ways Of Detection Of Humans
Procedures & Methods
Simple classification of robots
Types Of Rescue Robots etc...
An autonomous underwater vehicle (AUV) is a robot which travels underwater without requiring input from an operator. AUVs constitute part of a larger group of undersea systems known as unmanned underwater vehicles, a classification that includes non-autonomous remotely operated underwater vehicles (ROVs) – controlled and powered from the surface by an operator/pilot via an umbilical or using remote control. In military applications AUVs are more often referred to simply as unmanned undersea vehicles (UUVs).
Are current antivirus programs able to detect complex metamorphic malware an ...UltraUploader
This document presents research evaluating current antivirus programs' ability to detect complex metamorphic malware. The researchers designed a metamorphic engine to mutate malware code while preserving functionality. They applied this engine to the MyDoom worm and tested major antivirus products. Most products were unable to reliably detect the mutated MyDoom, demonstrating limitations in static detection techniques and the need for dynamic behavioral analysis to identify evolved malware variants.
GROUP FUZZY TOPSIS METHODOLOGY IN COMPUTER SECURITY SOFTWARE SELECTIONijfls
This document presents a group fuzzy TOPSIS methodology for selecting antivirus software. It discusses the risks of malware and importance of antivirus software. It then reviews fuzzy sets, fuzzy numbers, TOPSIS, and group fuzzy TOPSIS decision making approaches. The document applies these techniques to evaluate 7 popular antivirus alternatives based on 7 criteria from expert opinions. Sensitivity analysis is conducted on the results to provide insights for different user needs.
Truly dependable software systems should be built with structuring techniques able to decompose the software complexity without
hiding important hypotheses and assumptions such as those regarding
their target execution environment and the expected fault- and system
models. A judicious assessment of what can be made transparent and
what should be translucent is necessary. This paper discusses a practical
example of a structuring technique built with these principles in mind:
Reflective and refractive variables. We show that our technique offers
an acceptable degree of separation of the design concerns, with limited
code intrusion; at the same time, by construction, it separates but does
not hide the complexity required for managing fault-tolerance. In particular, our technique offers access to collected system-wide information
and the knowledge extracted from that information. This can be used
to devise architectures that minimize the hazard of a mismatch between
dependable software and the target execution environments.
OOP organizes a program into interacting objects. Classes and objects are core concepts - a class is a blueprint for creating objects with common properties and methods. An object has a state, behavior, and identity. Methods define reusable blocks of code that can be invoked on objects. Parameters allow methods to accept input data, while return values allow methods to provide output data.
Method-Level Code Clone Modification using Refactoring Techniques for Clone M...acijjournal
This document describes a method for modifying code clones using refactoring techniques. It discusses using the CloneManager tool to detect clones and then applying three refactoring patterns (extract method, move method, pull up method) to modify the clones. The approach is implemented as an enhancement to the CloneManager tool, allowing it to both detect clones and modify them using refactoring. The enhanced tool is tested on open source projects and results are compared to other clone detection tools.
Automatic reverse engineering of malware emulatorsUltraUploader
This document proposes techniques for automatically reverse engineering malware emulators. It presents an algorithm using dynamic analysis to execute emulated malware, record the x86 instruction trace, and use data flow and taint analysis to identify the bytecode program and extract syntactic and semantic information about the bytecode instruction set. The authors implemented a proof-of-concept system called Rotalumé, which accurately revealed the syntax and semantics of emulated instruction sets for programs obfuscated by VMProtect and Code Virtualizer.
Software Refactoring Under Uncertainty: A Robust Multi-Objective ApproachWiem Mkaouer
This document describes a multi-objective robust optimization approach for software refactoring that accounts for uncertainty in code smell severity levels and class importance. The approach formulates refactoring as a multi-objective problem to find solutions that maximize both quality, by correcting code smells, and robustness to changes in severity levels and importance. An evaluation on six open source projects found the approach generates refactoring solutions comparable in quality to existing approaches but with significantly better robustness across different scenarios.
This document discusses modeling and identifying spacecraft systems using adaptive neuro fuzzy inference systems (ANFIS). It presents ANFIS as a framework for controlling nonlinear multi-input multi-output systems with uncertainties. The document analyzes four cases of identifying a spacecraft system: deterministic models without and with noise, and ANFIS models without and with noise. It describes using ANFIS to represent a multi-input multi-output system as coupled input-output models. Experimental results demonstrate ANFIS's effectiveness in system identification.
Harnessing deep learning algorithms to predict software refactoringTELKOMNIKA JOURNAL
During software maintenance, software systems need to be modified by adding or modifying source code. These changes are required to fix errors or adopt new requirements raised by stakeholders or market place. Identifying thetargeted piece of code for refactoring purposes is considered a real challenge for software developers. The whole process of refactoring mainly relies on software developers’ skills and intuition. In this paper, a deep learning algorithm is used to develop a refactoring prediction model for highlighting the classes that require refactoring. More specifically, the gated recurrent unit algorithm is used with proposed pre-processing steps for refactoring predictionat the class level. The effectiveness of the proposed model is evaluated usinga very common dataset of 7 open source java projects. The experiments are conducted before and after balancing the dataset to investigate the influence of data sampling on the performance of the prediction model. The experimental analysis reveals a promising result in the field of code refactoring prediction
Unveiling Metamorphism by Abstract Interpretation of Code PropertiesFACE
Metamorphic code includes self-modifying semantics-preserving transformations to exploit code diversification. The impact of metamorphism is growing in security and code protection technologies, both for preventing malicious host attacks, e.g., in soft- ware diversification for IP and integrity protection, and in malicious software attacks, e.g., in metamorphic malware self-modifying their own code in order to foil detection systems based on signature matching. In this paper we consider the problem of automatically extracting metamorphic signatures from metamorphic code. We introduce a semantics for self-modifying code, later called phase semantics, and prove its correctness by showing that it is an abstract interpretation of the standard trace semantics. Phase semantics precisely models the metamorphic code behavior by providing a set of traces of programs which correspond to the possible evolutions of the metamorphic code during execution. We show that metamorphic signatures can be automatically extracted by abstract interpretation of the phase semantics. In particular, we introduce the notion of regular metamorphism, where the invariants of the phase semantics can be modeled as finite state automata representing the code structure of all possible meta-morphic change of a metamorphic code, and we provide a static signature extraction algorithm for metamorphic code where metamorphic signatures are approximated in regular metamorphism.
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...kevig
This study investigates the effectiveness of Knowledge Named Entity Recognition in Online Judges (OJs). OJs are lacking in the classification of topics and limited to the IDs only. Therefore a lot of time is consumed in finding programming problems more specifically in knowledge entities.A Bidirectional Long Short-Term Memory (BiLSTM) with Conditional Random Fields (CRF) model is applied for the recognition of knowledge named entities existing in the solution reports.For the test run, more than 2000 solution reports are crawled from the Online Judges and processed for the model output. The stability of the model is also assessed with the higher F1 value. The results obtained through the proposed BiLSTM-CRF model are more effectual (F1: 98.96%) and efficient in lead-time.
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
This study investigates the effectiveness of Knowledge Named Entity Recognition in Online Judges (OJs). OJs are lacking in the classification of topics and limited to the IDs only. Therefore a lot of time is consumed in finding programming problems more specifically in knowledge entities.A Bidirectional Long Short-Term Memory (BiLSTM) with Conditional Random Fields (CRF) model is applied for the recognition of knowledge named entities existing in the solution reports.For the test run, more than 2000 solution reports are crawled from the Online Judges and processed for the model output. The stability of the model is
also assessed with the higher F1 value. The results obtained through the proposed BiLSTM-CRF model are more effectual (F1: 98.96%) and efficient in lead-time.
Discrete event systems comprise of discrete state spaces and eventNitish Nagar
This document discusses software for modeling and analyzing discrete event systems. It provides examples of software for control synthesis (TCT), verification (COSPAN), timed discrete event systems (KRONOS, UPPAAL, TTCT), and hybrid systems (HYTECH, SHIFT). It also lists examples and benchmarks for testing software, including examples of supervisory control.
A NOVEL APPROACH TO ERROR DETECTION AND CORRECTION OF C PROGRAMS USING MACHIN...IJCI JOURNAL
There has always been a struggle for programmers to identify the errors while executing a program- be it
syntactical or logical error. This struggle has led to a research in identification of syntactical and logical
errors. This paper makes an attempt to survey those research works which can be used to identify errors as
well as proposes a new model based on machine learning and data mining which can detect logical and
syntactical errors by correcting them or providing suggestions. The proposed work is based on use of
hashtags to identify each correct program uniquely and this in turn can be compared with the logically
incorrect program in order to identify errors.
The Last Line Effect. Abstract: Micro-clones are tiny duplicated pieces of code; they typically comprise only a few statements or lines. In this paper, we expose the “last line effect”, the phenomenon that the last line or statement in a micro-clone is much more likely to contain an error than the previous lines or statements. We do this by analyzing 208 open source projects and reporting on 202 faulty micro-clones.
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSISijseajournal
One major problem in maintaining a software system is to understand how many functional features in the
system and how these features are implemented. In this paper a novel approach for locating features in
code by semantic and dynamic analysis is proposed. The method process consists of three steps: The first
uses the execution traces as text corpus and the method calls involved in the traces as terms of document.
The second ranks the method calls in order to filter out omnipresent methods by setting a threshold. And
the third step treats feature-traces as first class entities and extracts identifiers from the rest method source
code and a trace-by-identifier matrix is generated. Then a semantic analysis model-LDA is applied on the
matrix to extract topics, which act as functional features. Through building several corresponding
matrices, the relations between features and code can be obtained for comprehending the system functional
intents. A case study is presented and the execution results of this approach can be used to guide future
research.
A hybrid model to detect malicious executablesUltraUploader
This document presents a hybrid model for detecting malicious executables that uses three types of features: binary n-grams extracted from executable files, assembly n-grams extracted from disassembled executables, and DLL function calls extracted from program headers. A classifier like SVM is trained on the combined "hybrid feature set" to distinguish between benign and malicious executables. The model achieves high detection accuracy and low false positive rates compared to other feature-based approaches.
Software Defect Prediction Using Radial Basis and Probabilistic Neural NetworksEditor IJCATR
This document discusses using neural networks for software defect prediction. It examines the effectiveness of using a radial basis function neural network and a probabilistic neural network on prediction accuracy and defect prediction compared to other techniques. The key findings are that neural networks provide an acceptable level of accuracy for defect prediction but perform poorly at actual defect prediction. Probabilistic neural networks performed consistently better than other techniques across different datasets in terms of prediction accuracy and defect prediction ability. The document recommends using an ensemble of different software defect prediction models rather than relying on a single technique.
This document summarizes a degree project from KTH Royal Institute of Technology in Stockholm, Sweden from 2015. The project investigates threat modeling of historical cyber attacks using the Cyber Security Modeling Language (CySeMoL). Three documented cyber attacks are modeled to analyze the strengths and weaknesses of CySeMoL and propose improvements to the modeling language. The attacks modeled are Stuxnet, Diginotar, and an attack on Logica.
Similar to 414351_Iason_Papapanagiotakis-bousy_Iason_Papapanagiotakis_Thesis_2360661_357965939 (20)
1. University College London
Learning Rewrite Rules of
Metamorphic Obfuscation Engines
by
Iason Papapanagiotakis-Bousy
supervised by
Dr. Earl T. Barr
August 31, 2016
This report is submitted as part requirement for the MSc in Information Security at
University College London. It is substantially the result of my own work except
where explicitly indicated in the text. The report may be freely copied and
distributed provided the source is explicitly acknowledged.
2. Abstract
Metamorphic program obfuscations are semantics preserving program trans-
formation. In the space of malware they are used to avoid detection from static
analysis. The perceived usage of metamorphic obfuscations is that a part of
the malware will rewrite itself before infecting a new host. This dissertation
defines a different model, external metamorphic obfscations, in which the ob-
fuscation is done a priori. This process of mutating malware is analysed and
compared to existing research. We then define the problem of learning the
rewriting rules of such obfuscations given a finite set of its outputs. We prove
that it is impossible to solve it and relax it to an optimization problem for
which we give a solution under assumptions. Our work is the first step to-
wards a new direction in metamorphic program transformations research that
spawned additional interesting questions for future work.
Keywords: Metamorphism, Malware, Term Rewriting Systems, Rule Learn-
ing, Obfuscation
1
4. 1 Introduction
The expansion of connected computing systems make them a lucrative target for
criminals that have evolved into organized professional groups or are state sponsored
cyberwarfare units that produce malicious software — malware — to achieve their
goals. This is apparent from everyday news that report security breaches, electronic
scams and frauds but also from the evolution of the information security field where
professionals and academics work on detecting, classifying and mitigating such at-
tacks.
Malware authors and security researchers are in a perpetual arms race, with the
former discovering new attack vectors and inventing increasingly complex techniques
to avoid detection and the latter working on protecting information systems from
such attacks.
One of the earliest techniques employed to detect known malware has been signa-
tures. Anti-malware software would have a list of the programs classified as malware
by the researchers and before running a program would check that it does not appear
in that list. To counter that measure, malware authors have developed obfuscation
techniques that change the “appearance” (syntactic representation) of a program.
This creates a broad classification of malware with respect to the obfuscation used.
Malware researchers have designated three different obfuscation classes. These are,
given by increasing order of complexity: Oligomorphic, Polymorphic and Metamor-
phic. For the evolution of obfuscations and their countermeasures we refer to the work
of O’Kane et al. [33]. Our work focuses on the class of metamorphic obfuscations
that is described in Section 3, the other two have been widely studied but are still
relevant, especially polymorphic malware has many open questions for researchers.
The term metamorphic obfuscations refers to the set of semantics preserving
code transformations that can be used to alter the syntax of a program. This code
transformation is done by what is called the obfuscation engine. While the common
case studied by researchers is for the obfuscation engine to be part of the malware, in
this work we describe how this could be done differently. Specifically, we distinguish
internal and external obfuscation engines and formally define the latter in Section 3.
3
5. We describe the advantages (and disadvantages) of using an external obfuscation
engine as well as some assumptions on how they might apply the obfuscations and
their impact.
Central to our definition of external metamorphic obfuscation engines and the rest
of our work are term rewriting systems. Modeling an obfuscation engine as a term
rewriting system has been proposed as an elegant formalism that maps obfuscations
to rewriting rules. In Section 2.1, we give an introduction to term rewriting systems
and the notation used throughout the rest of this work.
Having defined metamorphic obfuscations using term rewriting system, in Sec-
tion 4 we then tackle the following research questions: “Are metamorphic obfuscations
learnable from a sample of obfuscated programs?” and if so “under what assumptions
can we learn the rules of a term rewriting system that approximate the obfuscation
engine?” and “what classes of rewriting rules are learnable?”. To do so, we combine
term rewriting systems theory, computational learning theory and association rule
mining introduced in Section 2.1, Section 2.3 and Section 2.4 respectively.
The main contributions of this work are:
1. We define external metamorphic obfuscation engines using term rewriting sys-
tems. We describe their strengths and weaknesses and hypothesise on how they
might work internally.
2. We define the problem of learning the rewriting rules of an obfuscation engine
given a finite sample of program variants generated from a single archetype
program.
3. We prove the impossibility of the rewrite rule learning problem described above
and relax the problem to an optimization problem with a trivial solution.
4. We then give an algorithm for solving the relaxed problem under some assump-
tions and argue that it is the first step towards a more general result.
4
6. 1.1 A Motivating Example
In order to give some context of the learning problem we define in Section 4, we now
give a small example. Suppose a malware author wrote a program α in a programming
language with only the following instructions Σ = {ADD, MUL, NOP, JMP} (oper-
ants are ignored for simplicity). Let α = NOP.ADD.MUL.JMP.ADD.JMP.MUL.
Now the author of α wants to obfuscate the program to make it look different. To do
so, the author uses the following two rules: R := {MUL → ADD.ADD.ADD, ǫ →
NOP}. By randomly applying some of these rules on parts of program α he ends
up with some variants of the original program. Let S denote the collection of those
variants and suppose
S = {NOP.ADD.NOP.MUL.JMP.ADD.JMP.MUL,
NOP.ADD.MUL.JMP.ADD.JMP.ADD.ADD.ADD,
NOP.ADD.MUL.JMP.ADD.NOP.JMP.MUL}
For a malware researcher, the problem is to be able to classify all programs in S
as different forms of the same original program. In order to do so, it is very useful
for the researcher to know what were the obfuscating program transformations used.
This brings us to our main research problem, given S learn R.
5
7. 2 Background
2.1 Term Rewriting Systems
Term rewriting systems (TRS) provide an elegant, abstract and simple, yet powerful,
computation mechanism. TRS are a fully general programming paradigm as they
can simulate Turing machines [24, 16] . The basic idea is very simple: replacement of
equals by equals by applying symbolic equations over symbolically structured objects,
terms. Applying equations in one direction only immediately leads to the concept of
(directed) term rewriting. The most well-known TRS is probably λ − calculus, that
has played a crucial role in mathematical logic with respect to formalising the notion
of computability. [35]
In the rest of this section we give the basic notation and properties that are nec-
essary to our work in the following chapters. We made an effort to have a consistent
notation with the TRS literature but since there are conflicts for some parts we have
chosen what we believe makes most sense. The information presented here was taken
from the works of Franz Baader and Tobias Nipkow [7], Nachum Dershowitz and
Jean-Pierre Jouannaud [17], Yoshihito Toyama [38], Dana Ron [35] and Wikipedia
[46].
Let V ar(s) denote the set of variables occurring in a term s.
Definition 2.1. A rewrite rule is an identity l ≈ r such that l is not a variable
and V ar(l) ⊇ V ar(r). In this case we may write l → r instead of l ≈ r. The left-
hand side and the right-hand side of a rule r = (l, r) are given by lhs(r) and rhs(r)
respectively.
Definition 2.2. A term rewriting system (TRS for short) is a pair T , R con-
sisting of a set of terms T and a set R ⊆ T × T of (rewrite or reduction) rules.
A redex (reducible expression) is an instance of the lhs of a rewrite rule. Con-
tracting the redex means replacing it with the corresponding instance of the rhs of
the rule.
Definition 2.3. Let T , R be a TRS.
6
8. (1) → is the one-step rewrite relation. We write a → b iff there is redex of a rule r ∈
R and contracting it produce b. ∃r ∈ R s.t. a = x1.lhs(r).x2 ∧ b = x1.rhs(r).x2.
(2)
∗
→ is the transitive closure of → ∪ =, where = is the identity relation, i.e.
∗
→ is
the smallest preorder (reflexive and transitive relation) containing →. It is also
called the reflexive transitive closure of →.
(3) ↔ is → ∪
−1
−→, that is the union of the relation → with its inverse relation, also
known as the symmetric closure of →.
(4)
∗
↔ is the transitive closure of ↔ ∪ =, that is
∗
↔ is the smallest equivalence
relation containing →. It is also known as the reflexive transitive symmetric
closure of →.
Definition 2.4. An term x ∈ T is called reducible iff ∃y ∈ T | x → y , otherwise
x is called irreducible or a normal form.
Definition 2.5. Two terms x, y ∈ T are joinable iff ∃z ∈ T | x
∗
→ z
∗
← y. z may
or may not be a normal form. If x and y are joinable we write x ↓ y.
Since multiple rewrite rules can have multiple redexes on a term, applying them
on a different order obviously leads to non-deterministic computations. This leads
to two fundamental questions on TRS:
• Do all computations eventually stop?
• If two sequences of rewritings, starting from the same term, diverge at one
point, can they eventually be rejoined?
The first question is the termination problem and the second one the confluence
problem.
Definition 2.6. A rewriting system T , R is terminating, also called noetherian,
iff ∀t ∈ T there are no infinite sequence of rewritings. In such a system every object
has at least one normal form.
7
9. Theorem 2.1 (Term Rewriting and all that). The following problem is in general
undecidable:
Given a finite term rewriting system R = T , R , is R terminating, i.e. is there no
term t ∈ T starting an infinite reduction sequence?
Although undecidable in the general case, the termination problem has decidable
subclasses that are of interest. A term rewriting systems R = T , R such that
∀r ∈ R V ar(rhs(r)) = ∅ is called right-ground. If for a right-ground TRS it also
holds that ∀r ∈ R V ar(lhs(r)) = ∅, the TRS is a String Rewriting System, usually
called semi-Thue System. The termination problem is decidable for right-ground
TRS and semi-Thue systems [7].
Definition 2.7 (The Church-Rosser property). A rewriting system is confluent iff
∀x, y ∈ T | x
∗
↔ y =⇒ x ↓ y.
Definition 2.8. A TRS R = T , R is locally confluent iff
∀x, y, z ∈ T x → y ∧ x → z =⇒ y ↓ z.
Local confluence is important to show confluence of a term rewriting system.
Hue [23] shows that a terminating TRS is confluent iff it is locally confluent. The
deciding factor for local confluence are the rules that form critical pairs. This concept
was discovered by Knuth and Bendix [28] in their paper that delivered fundamental
results in confluence of term rewriting systems. Two rules r1, r2 form a critical pair
if the prefix of one matches the suffix of the other or if one is a subterm of the other.
Theorem 2.2. A terminating TRS is confluent iff all its critical pairs are joinable.
This result led to the Knuth-Bendix completion algorithm which tries to resolve
critical pair divergences by adding fitting rewrite rules.
Definition 2.9. A term rewriting systems that is terminating and confluent is called
convergent.
2.1.1 The Word Problem
In this section we give a description of the word problem. It is one of the most im-
portant and studied problems in the term rewriting systems research. Here, we want
8
10. to introduce the problem for readers that are not familiar with it and later on, in
Section 4, we argue its relevance with our research problem.
Definition 2.10 (The Word problem). Given a term rewriting system T , R and
x, y ∈ T , are x and y equivalent under
∗
←→
R
?
The problem is mostly encountered in abstract algebra where it is formulated as
“algorithmically determine if two representations of elements of a group are different
encodings of the same element”. The problem is in general undecidable and it has
been shown by William Boone that “There is exhibited a group given by a finite num-
ber of generators and a finite number of defining relations and having an unsolvable
word problem” [10].
Nevertheless, there has been plenty of research to prove decidability of the word
problem in specific types of groups. One of the most important results has been the
Knuth-Bendix completion algorithm, mentioned earlier, that can transform a termi-
nating term rewriting system to a convergent one [28]. This makes the word problem
decidable since it is terminating and each term has a single normal form. The limita-
tions of the algorithm come from the fact that it is solving a, in general, undecidable
problem which means that either it succeeds and output the convergent TRS or it
does not terminate. This “weakness” could be exploited in an adversarial model by
using techniques such as those proposed by Simonaire to render the algorithm use-
less [36]. In practice however it is the best known term rewriting system completion
algorithm.
2.2 Finding Differences
As shown in the simple example given in the introduction, applying different obfus-
cation rules to different redexes will result (in most cases) to different strings. The
fact that the strings we work on are different gives us little to no information, we
are rather interested in how they differ. Finding the best (most compact) way of ex-
pressing the difference of two strings amounts to computing the Levenshtein distance
9
11. together with the necessary edit transcript of the two strings. This is also equivalent
to computing the longest common subsequence which focuses on finding the common
parts while the Levenshtein distance is about computing the differences.
Definition 2.11 (Levenshtein Distance). Given two strings s1 and s2 over an alpha-
bet Σ and the edit operations:
• Insert a symbol: ǫ → x such that uv gives uxv.
• Delete a symbol: x → ǫ such taht uxv gives uv.
• Substitute a symbol: x → y, where x = y, such that uxv gives uyv.
The Levenshtein distance of s1 and s2 is the minimun number of operations re-
quired to transform s1 into s2.
The Levenshtein distance belongs to the family of edit distances, other edit dis-
tances metrics are used with a different set of edit operations or different cost on
each operation (in the Levenshtein Distance each operation has a cost of 1).
Definition 2.12. The global sequence alignment problem is to compute the
Levenshtein distance along with the transcript of necessary edit operations.
The problem was first solved by Needleman and Wunsch with dynamic program-
ming [32]. The complexity of the algorithm for two strings of length M and N is
O(MN) in time and O(MN) in space. Hirshberg invented a modification that re-
duces the space complexity to O(N) [21]. Hunt and McIlroy [25] proposed a heuristic
improvement with text files in mind. Their algorithm was implemented for the first
version of the Unix diff program. The diff program was later updated to use My-
ers algorithm [31] that has linear space complexity and achieves an expected-case
time complexity of O((M + N) + D2
) where D is the length of the longest common
subsequence.
2.3 Computational Learning Theory
In this section we give an introduction to the field of Computational Learning Theory.
This is not aimed to fully cover the research field but rather to introduce its basic
10
12. concepts to the readers and motivate some of the problems we encounter on the
following sections.
From the Association for Computational Learning: Computational Learning The-
ory is a research field, part of Artificial Intelligence, that studies the design and anal-
ysis of Machine Learning algorithms. In particular, such algorithms aim at making
accurate predictions or representations based on observations.
The emphasis in Computational Learning Theory is on rigorous mathematical
analysis using techniques from various connected fields such as probability, statistics,
optimization, information theory and geometry. While theoretically rooted, learning
theory puts a strong emphasis on efficient computation as well [1]. Learning prob-
lems are modeled as an algorithm (the learner) that tries to learn a concept (i.e. a
language) given an information presentation method.
The field originated with the work of Mark Gold that studied language learnability
[20]. In his paper, Gold defines language learning as an infinite process where at each
time unit, the learner receives a unit of information and is to make a guess as to the
identity of the unknown language on the basis of the information received so far. He
considers a class of languages learnable or identifiable in the limit with respect to
the information presentation method used if there is an algorithm such that: Given
any language of the class, there is some finite time after which the guesses will all be
correct. The information presentation methods studied were:
• Text, where the learner is presented with strings from the target language L in
random order.
• Informant that can label strings as belonging to the target language L or not
and can choose a specific order to present information to the leaner.
Gold found that under this model of learning the class of context-sensitive lan-
guages are learnable form an informant, but not even the class of regular languages
is learnable from text.
In 1984 Leslie Valiant proposed the Probably Approximately Correct (PAC)
model for learning [41]. PAC is considered amongst the most significant results
in computational learning theory as it provided an attractive general model to study
11
13. the computational, statistical and other aspects of learning [39]. Like in Gold’s work,
the PAC framework has a learner that receives samples and must form a generalisa-
tion function (called the hypothesis) that will allow him to classify unseen instances.
The learner is either presented with samples following an arbitrary distribution or
has access to an informant to which he can make queries. Unlike identification in
the limit, PAC allows the hypothesis to have a bounded generalisation error (the
“approximately correct” part) with high probability (the “probably” part).
A more detailed overview of the computational learning theory field can be found
in the book of Kearns and Vazirani [27] and the papers of Angluin [6] and Turan [39].
As our main research question is the learning problem defined in section 4, we
explored Computational Learning Theory in order to have a formal framework in
which we can characterise our problem. The first step was to learn the formalism
used in this space and what are the main results. Once we had defined our learning
problem, we tried to map it to other learning problems that had already been studied
in order to leverage existing results to reach a conclusion about our research question.
2.4 Association Rule Mining
Our research problem defined in Section 4 is one of learning rules. More precisely,
we are interested in finding the smallest set of rules that can “describe” the largest
set of pairs of strings. This problem has been studied extensively in the field of data
mining giving birth to association rule mining.
Association rule mining, credited to Argwal et al. [4], defines a set of measures
of interestingness of patterns found in large sets of transactions. It was originally
conceived to discover common patterns in shopping habits from data generated by
point-of-sale systems.
In our rule selection, we used the two best-known measures, support and confi-
dence. We should point out that in our use-case, all transactions contain exactly two
items. Thus, we give the definitions of support and confidence used for transactions
of two items.
Let T be a collection of transactions of exactly two items. We consider every
12
14. transaction to be a rule.
Support measures the coverage of a transaction t ∈ T, or how frequent it is.
supp(t) =
# of occurrences of t
|T|
Confidence measures the accuracy of a rule, or how frequently the lhs(t) implies
the rhs(t).
conf(t) =
# of occurrences of t
# of occurrences of lhs(t)
Rule mining is usually done by manually defining a minimum support and then
looking for rules with high confidence.
13
15. 3 Offline Metamorphic Obfuscation
obfuscate • from the Latin ob- “in the way” and fuscus “dark brown”,
means to make obscure/confusing [2].
In this dissertation we are considering program obfuscations, that is obfuscations
that modify a program in order to make it “look” different but maintain its func-
tionality. The use cases of program obfuscation can vary, our work uses malware
obfuscation for examples and as a motivation but the following ideas might be ap-
plicable to a wider range of program obfuscation use cases.
Metamorphism is a class of obfuscations that aims to transform a program into
a different but equivalent one in order to avoid detection. As described by Szor
[37] “they are able to create new generations that look different”. A metamorphic
obfuscation engines can be modeled as term rewriting system [44] that modifies the
syntax of program p and output a syntactically different program p′
that maintains
the semantics of p. Let P be the set of all syntactically correct programs in some
instruction set (i.e. x86 assembly). Let e() : P → N be a predicate that returns the
computational resources (both space and time) necessary of a program.
Definition 3.1 (Metamorphic Obfuscation Engine). A metamorphic obfuscation en-
gine is a tuple O = P, R, A where P, R is a term rewriting systems over programs
and A is an algorithm for applying the rules in R.
p′
= O(p) such that p = p′
∧ (1)
p = p′
∧ (2)
∃n ∈ N : O(e(p′
)) ∈ O(e(p)n
) (3)
The first condition is needed in order for the metamorphic engine to not break
the functionality of the program. The second condition is representing the syntactic
transformation of the input program p, the more different the output is from the
input the better the obfuscation. Finally, the constraints on the time and space
14
16. complexity of the output program are non-functional requirements. If p′
needed ex-
ponentially more resources than p, it could not execute correctly in similar execution
environments.
The effort to formally define the space of malware, and in particular viruses,
started with Adleman [3] who gave a formal description of the different types of
computer viruses. In his work he identifies that their key functions are injure, infect
and imitate. At that time, Adleman did not consider program obfuscation. Zuo et
al. [48] updated those definitions to capture the effect of obfuscation that modern
malware uses. In their paper, Zuo et al. define metamorphic viruses as follows:
Let (d, p) denote the environment (data and programs). T(d, p) and I(d, p) are
called trigger and infection condition, respectively. When T(d, p) holds, the virus
executes the injury function D(d, p) and when I(d, p) holds, the virus uses S(p) to
select a program to infect.
Definition 3.2 (Metamorphic Virus [48]). The pair (v, v′
) of two different total
recursive functions v and v′
is called a metamorphic virus if for all x, (v, v′
) satisfies:
φv(x) =
D(d, p), if T(d, p)
φx(d, p[v′
(S(p))]), if I(d, p)
φx(d, p), otherwise
(4)
and
φv′(x) =
D′
(d, p), if T′
(d, p)
φx(d, p[v(S′
(p))]), if I′
(d, p)
φx(d, p), otherwise
(5)
This can be extended to a tuple of n recursive functions (v1, v2, . . . vn) to capture
more complex metamorphic viruses.
In this definition, the obfuscation engine is assumed to be inside the malware.
At the time of infection the obfuscation engine will generate the new variant, we
call it an internal obfuscation engine as it is part of the malware. However, the
obfuscation engine could be separated from the malware and generate variants of the
original program at an earlier time, we will call such obfuscation engine external.
15
17. This difference has been mentioned by Walestein et al. and Okane et al. as open
and closed world obfuscations [45, 33]. Both terminologies capture the same notion
but while the internal/external focuses on the obfuscation process, the open/closed
focuses on the set of known facts.
Informally, we say that internal malware obfuscation is when the obfuscation is
done by the malware at the time of infection of a new host. External obfuscation
is the case when the obfuscation of the malware is independent of the infection. In-
ternal obfuscation is the common case studied by researchers, possibly falling under
the “streetlight effect”, as it is easier to study obfuscation engines that can be recov-
ered by disassembly rather than hypothesise their existence in the hands of malware
authors.
In this work, we argue that there is value in defining the external model for
metamorphic obfuscations. Consider for example, malware obfuscated by humans, it
has to be external obfuscation. This could be the case in spear phishing where the
number of targets is very small and the “success” of the malware very important. In
such instance, a human could rewrite the malware to obfuscate its function before
sending it to the victim.
Definition 3.3 (External Metamorphic Obfuscation). Let O denote an obfuscation
engine as described in definition 3.1 and p a program. The program p′
is externally
obfuscated metamorphically iff:
p′
= O(p) ∧ O /∈ p′
The difference with definition 3.1 is that firstly while Zuo et al. use total recursive
functions to capture the obfuscations, we use term rewriting systems, like Walestein
et al. [44], as we found it more appropriate. Secondly, the difference of internal and
external obfuscation is that unlike Zuo et al. and Walenstein et al. [48, 44, 45], we
separate the obfuscation from the infection.
Walestein et al. made an analysis of the design space of metamorphic malware
that include their obfuscation engine [45]. The paper explains why internal meta-
morphic obfuscation engines are difficult to design. Their complexity is due to the
design space being a recursive one.
16
18. From the point of view of the author a metamorphic obfuscation engine there are
many benefits to choose the external model over the internal.
(1) Simplification. The design space is no longer necessarily recursive. The meta-
morphic engine does not have to locate the malware payload and disassemble
it since it can work directly on the source code. Additionally, the approximate
control flow graph can easily be constructed and non-normalisable obfuscations
can be applied.
(2) Stealth. Complex internal obfuscation engines like Win32/Simile tend to be
large segments of code making up to 90% of the size of the malware. Removing
it reduces significantly the footprint of the malware, making it less likely to be
detected.
(3) Durability. Metamorphic engines are very complicated software and valuable for
malware authors but despite all countermeasures, internal metamorphic engines
have been reverse engineered. Once the mutations are known to the research
community, the effectiveness of the engine is greatly reduced. With the exter-
nal model, malware authors are protecting the metamorphic engine from direct
disassembly.
The downside of having an external metamorphic engine would be that the mal-
ware is no longer self contained when it comes to infecting a new target with a
modified copy of itself. One possibility would be for it to contact a remote server to
get a new instance when it is about to infect a new target. Alternatively, malware
authors are not interested in the self propagating property and leverage the increas-
ing number of attack vectors to infect new hosts (i.e. an web server under the control
of the malware author obfuscates a malicious Javascript file before serving it to its
clients).
3.1 Application Strategies
Rewriting systems offer a very convinient way to model rewrite obfuscation engines
but do not provide an algorithm for changing one term to another. They are in gen-
17
19. eral, as mentioned in Section 2.1, a non-deterministic computational model. However,
obfuscation engines are deterministic processes as they are real programs.
To go from the (possibly) non-deterministic computational model to a program,
an obfuscation engine needs an algorithm A to “enforce determinism” 1
. The general
description of A is as follows:
Data: set of rewrite rules R;
input string p;
number of iterations maxiter;
Result: a string p′
= p such that p′ n
←→
R
p : n ∈ N ∧ n ≤ maxiter
i = 0;
p′
= p;
redexes = list of all the redexes of rules in R found in p′
;
while i ≤ maxiter ∧ redexes is not empty do
r = select redex(redexes);
p′
= contract(r);
update redexes;
i = i + 1
end
return p′
Algorithm 1: The template for a TRS rule application strategy
The obfuscation engine O can then run the algorithm A to generate a variant of
a program p. Two factors remain to be determined, the select redex function and
the number of iterations of A. Selecting those two parameters defines what we call
the application strategy of the obfuscation engine. The number of iterations only
determines how many redexes will be contracted each time we execute O, a low
maxiter value will execute faster and could produce a higher number of variants 2
while a higher value will make each variant produced more different from its ancestor.
1
Here we use determinism in its computer science sense, the resulting algorithm is allowed to
use randomness.
2
This is the case if the total number of equivalent programs is finite.
18
20. The select redex function, on the other hand, is more interesting. The author of the
obfuscation engine could write the select redex to:
1. Always pick the first/last/i-th redex.
2. Pick a redex at random.
3. Have some other complex algorithm to decide what redex to choose.
In the next two paragraphs, we explore first the fixed strategy (option 1) and
why it is not fitting and then we consider options 2 and 3 in a single rule application
strategy called random rule — random offset.
3.1.1 Fixed Strategy
The first choice is what would happen by using simple regular expressions libraries
(i.e. the regex Python module always returns the first match). Although it is straight-
forward to implement, in practice it is a very bad choice for obfuscation. That is
because the obfuscation engine will constantly change the beginning of the program,
leaving most of the program unchanged. For example, consider the algorithm A that
always picks the first redex as described above. Let r ∈ R be an injector rule such
that r = ǫ, σ . The redex of r in a program p will be just before the first symbol
of p, contracting it would give us ǫp → σp. Repeating the procedure n times would
result in σσ . . . σ
n
p, making it vulnerable to signature detection. We consider this
strategy too simplistic and vulnerable to be used by malware authors.
3.1.2 Random Rule — Random Offset Strategy
The second and third possibilities could be considered as a single case study for mal-
ware researchers as a complex enough redex selection algorithm (option 3) will look
like random processes (option 2) to an external observer. This has been previously
considered in the paper of Chouchane et al [12] that models metamorphic obfusca-
tion engines as probabilistic language generators. In this case, we will call the rule
application algorithm A the random rule — random offset strategy. We believe this
19
21. is the right model to study as long as we do not have concrete evidence of what is
actually used.
Assumption 3.1. A implements the random rule — random offset strategy.
Under that assumption, we are interested in quantifying the impact that a rewrite
rule can have on a program.
Let O = P, R, A denote a metamorphic obfuscation engine with a finite set of
n rewriting rules ri ∈ R. Each ri is assigned a rule application probability Pi. Given
a program p let Li be the set of all distinct redexes of the rule ri in p.
Definition 3.4. The relative impact of a rule ri ∈ R on a string p is:
Iri =
|Li| ∗ Pi
n
j=1
|Lj| ∗ Pj
With relative impact, we can compare the effect of two rules of a given rewrite
system on a string p.
Definition 3.5. The absolute impact of a rule ri ∈ R on a string p is:
Iai =
|Li|
length(p)
Although absolute impact does not consider the rule application probabilities, it
can be useful when we need to compare between rules of different rewrite systems on
the same string p.
3.2 Obfuscation Genealogy
While in Section 3.1 we analyse the possible rule application strategies that can
rewrite a program p into an different but equivalent program p′
, in this section we
demonstrate the differences in the genealogy of the variants between internal and
external obfuscations.
Let mi be a malware that includes a metamorphic obfuscation engine. Assuming
it can keep state or it is using randomness, the obfuscation engine can generate
multiple different variants mi+1,1, mi+1,2 . . . mi+1,n and can guarantee that all of them
20
22. are different than mi. We thus have the set: Mi+1 = O(mi) such that ∀m ∈ Mi+1 :
m = mi.
. . .
��
��+ , ��+ ,
��+ ,�
Figure 1: All mutations of first degree are different.
Each of the programs in Mi+1 will then run the obfuscation to generate new
variants. Because the obfuscation is done internally, when the engine will run on
mi+1,1 it cannot know if some of the variants it will generate were already generated
by some other variant i.e. mi+1,2. In general, let Mall denote all malware variants
generated, there is no guarantee that the set
∀m∈Mall
O(m) is empty.
What was intended to be a tree of program variants is in fact a graph. This is
due to the limited knowledge that is available to the obfuscation engine because of
the fact that it is internal.
An external obfuscation engine O, on the other hand, has the advantage that it
“knows” all program variants generated and thus, can avoid duplicates. The process
of generating a set of obfuscated malware variants in the external setting could be
described by the following algorithm:
21
23. . . .
��
��+ , ��+ ,
��+ ,�
...
��+ , ��+ , ��+ , ��+ ,
Figure 2: Different variants can generate the same new mutation.
Data: an external obfuscation engine O;
the original program α;
the set of all malware variants generated Mall;
Result: a bigger Mall set
Mall = {α};
while more variants are needed do
select p from Mall;
p′
= O(p);
if p′
/∈ Mall then
add p′
to Mall;
end
end
return Mall;
Algorithm 2: Offline generation of obfuscated program variants
Although the resilience against repeated instance is a theoretical advantage for
using the external obfuscation model, an interesting problem for future research
22
24. would be to bound the probability that internal metamorphic obfuscation encounter
such a collision. Depending the rewriting rules of the obfuscation TRS, this problem
can be more or less present. If the critical pairs of the TRS from circles then more
circles will appear in the graph. Only trivial, non interesting set of rules could
completely avoid ever generating the same program variant.
A related problem is counting how many viable variants can be generated in
each model. Viability is an important concept when it comes to program mutations.
We have identified two concerns on that matter: First, is the mutation efficient
enough to be considered viable (i.e. a program that takes 2 hours to open a network
connection)? Second, is the mutation doing what it is supposed to do? In other
words, is our TRS truly semantics preserving such that any sequence of rewritings
result in a semantically equivalent program?
Let A and B represent the viable program variants in the external and internal
model respectively. It is obvious that |A| ≥ |B| since we can simulate an internal ob-
fuscation using an external one (for the purpose of counting distinct viable variants),
but future research could focus on describing their relationship and the reasons that
might make B much smaller than A.
23
25. 4 Learning Obfuscation Rules from Finite Mal-
ware Samples
As outlined in the introduction, one of the problems we set out to solve was to
(automatically) learn the rewrite rules of an obfuscation engine. Let Rall denote
all possible semantics preserving code transformations, the problem is to learn the
subset of rewrite rules used by an obfuscation engine. We define the problem for
a finite sample of obfuscated programs, variants of the same original program, for
which we have no access to the obfuscation engine used to transform them.
This section starts with the preliminaries where we introduce some additional
concepts and definitions. Subsequently, the learning problem is formally defined and
studied. Finally, we give an algorithm for an approximate solution under assump-
tions.
4.1 Preliminaries
Like in definition 3.1, we will consider obfuscation engines as term rewriting systems.
Something that is important is a notion of “equality” between term rewriting systems.
We call two TRS R1 and R2 equivalent if they have the same set of terms T and for
each rewriting of a term under R1 there is a chain of one or more rewritings under
R2 that give the same result.
Definition 4.1. A class of equivalent term rewriting systems E is defined as a set
of TRS such that ∀R1 = T , R1 , R2 = T , R2 ∈ E and ∀x, y ∈ T , x
∗
←→
R1
y iff
x
∗
←→
R2
y
Given this definition of equivalent term rewriting systems, it is important for the
purposes of our work to give a relaxed definition of observable equivalence. This is
motivated by the fact that we are interested in term rewriting systems that might be
equivalent for a (usually finite) subset of T .
Definition 4.2. A class of observable equivalent term rewriting systems Eo is de-
fined as a set of TRS such that ∀R1 = T , R1 , R2 = T , R2 ∈ Eo and ∀x, y ∈ T ′
⊆
24
26. T , x
∗
←→
R1
y iff x
∗
←→
R2
y
Having defined what equivalent term-rewriting systems are, an interesting prob-
lem is to measure the quality of a TRS R within a class E of equivalent term rewriting
systems. To the best of our knowledge, this problem has not been given a lot of at-
tention. In the only work that we found to touch on this characterisation of term
rewriting systems [40], the authors assert that the length of a rewriting rule is nega-
tively correlated with how “good” the rule is. Building on that intuition, we define
lR to be the total size of a TRS R and then define the subset of “optimal” TRS in
an equivalence class E. We use the Occam Razor principle to represent the quality
of a TRS.
Definition 4.3. The size of a term rewriting system R = T , R is:
lR :=
r∈R
len(r)
where len(r) := |rhs(r)| + |lhs(r)|
Definition 4.4. Given a class of equivalent term rewriting systems E, a TRS R =
T , R ∈ E is Occam Razor iff:
∀R′
∈ E : lR ≤ lR′
For the same practical reasons that we defined observable equivalent classes of
term rewriting systems in definition 4.2, we give a relaxed definition for the best TRS
in a class of observably equivalent term rewriting systems.
Definition 4.5. Given a class of observably equivalent term rewriting systems Eo, a
TRS R = T , R ∈ Eo is observably Occam Razor iff:
∀R′
∈ Eo : lR ≤ lR′
In practice, instead of talking about an Occam Razor term rewriting system in
an equivalence class E or an observable Occam Razor term rewriting system in an
observably equivalent class Eo, we might simply use the terms Occam’s razor TRS
and observable Occam’s razor TRS when the equivalence class is deductible from the
context.
25
27. 4.2 Problem Definition
We will now formulate our learning problem for metamorphic obfuscations. As men-
tioned in the beginning of the section, the fact that we want to capture both internal
and external obfuscations limit us to an offline learning problem following the def-
inition of Karp [26]. That is because if the obfuscation is done externally and we
have no access to the obfuscation engine, the only thing we have to learn from is a
finite sample set. Ideally, we would like to given a finite number of malware variants
generated from a single original program, the “archetype”, to infer the Occam razor
TRS that can rewrite any possible output of the metamorphic engine generated from
the same archetype into another.
Let P denote the set of all syntactically correct programs in some instruction set,
O = P, R, A denote a metamorphic obfuscation engine and α be the archetype
program. Let Pα : Pα ⊆ P be a set of all possible outputs of O on input α. Let
Sα : Sα ⊆ Pα be the finite sample set of observable outputs.
α
PPα
Sα
P′
Figure 3: The sets of programs.
By construction, we know that ∀pi, pj ∈ Pα there is a finite sequence of rewritings
such that On
(pi) = pj =⇒ pi
n
←→
R
pj. Where On
(p) denotes n applications of the
algorithm A on input p given the rules in R.
We now state the problem formally.
Problem 4.1 (Learning a Metamorphic Obfuscation Engine). Given Sα, learn a
term rewriting system R such that:
(i) R = Pa, R is Occam razor.
26
28. (ii) pi
∗
←→
R
pj ∀pi, pj ∈ Pα
Theorem 4.1. Given a set S, learning an Occam razor set R of rewriting rules is
impossible.
Proof. Term rewriting systems are equivalent to Turing machines [24, 16]. Consider
an algorithm A such that on input S it outputs the Occam razor — smallest — term
rewriting system. We could then use A to compute the Kolmogorov complexity of a
program p as K = |A(p)|.
Nevertheless, it is still interesting to find a suboptimal (in terms of size) term
rewriting system from the set of equivalent term rewriting systems E. Although
size is still important, as an exponentially bigger TRS than the original one would be
impractical, we relax condition (i) to a TRS R whose size is bounded by a polynomial
of the number of samples.
Problem 4.2. Given Sα, learn a term rewriting system R such that:
(i) lR = O (|Sa|n
) , n ∈ N.
(ii) pi
∗
←→
R
pj ∀pi, pj ∈ Pα
Theorem 4.2. Solving problem 4.2 is impossible.
Proof. Consider the term rewriting system R as a language generator. Learning
to recognize Pa (all possible strings of the language) from Sa (a finite sample of
positives examples), with no additional information, has been shown impossible by
Gold [20].
The impossibility of problem 4.2 can be stated informally as: “It is impossible
to generalize (knowledge) only from randomly presented positive examples without
error”.
Since it was condition (ii) that “caused” the impossibility, we relax that condition.
Instead of trying to learn a TRS that maintains the
∗
←→
R
property on all objects of
Pa, we instead aim to cover a subset P′
of Pa.
Problem 4.3. Given Sα, learn a TRS R such that:
27
29. (i) lR = O (|Sa|n
) , n ∈ N.
(ii) pi
∗
←→
R
pj ∀pi, pj ∈ P′
⊆ Pα
Note that in this case we do not put a restriction on the relation between P′
and
Sα.
Na¨ıve Solution By relaxing the requirement on the size of the rewriting system
that we need to learn, this problem becomes much easier. The trivial solution in this
case is storing all pairs of samples in Sa as rewriting rules.
R := {r : rhs(r) = pi ∧ lhs(r) = pj ∀pi, pj ∈ Sa}
Although this solves the problem 4.3, the solution is far from ideal. The resulting
term rewriting system will have two drawbacks:
• Big size, the size of the resulting TRS will be: lR = |Sa| ∗ (|Sa| − 1) ∗
|Sa|
i=1
|pi|
• Not generalizing, as the resulting rules will allow the rewriting of all p ∈ Sa but
not others.
The problem now becomes one of optimisation to find “the smallest term rewriting
system” that covers “the largest set P′
”. This is the topic of Section 4.3.
4.3 An Approximate Solution
In the following section we address the problem 4.3 in more detail. Although the
na¨ıve solution already given is poor, it illustrates the general direction we will be
following.
In order to improve upon the naive solution, we could try to minimise the TRS
generated, make it more generalising or both. Instead of storing each pair of strings,
the following solution uses the Levenshtein distance to minimise the amount of in-
formation stored to represent the difference of two strings.
28
30. Solution: Compute the edit transcripts for all pairs of strings in Sa. The edit
transcript as described in Section 2.2 can be seen as a function that transforms a
string into another. Let E(Sa) be the set of the edit transcript computed from all
the pairs of strings in Sa and let p denote any entire string (not a substring) to be
rewritten by the TRS. We can generate a TRS R = T , R as follows:
R := {r : lhs(r) = pi ∧ rhs(r) = e(pi, pj), ∀pi, pj ∈ Sa, e ∈ E(Sa)}
The generated TRS will still have
|Sa| ∗ (|Sa| − 1)
2
rules, the same number of
rules as the naive solution , but the rules will be smaller as they will represent only
the differences of the two strings. Note that given the resulting TRS and two strings
p1, p2 from Sa, we can’t “know” which rule of R to apply to get p2 from p1, we just
have the guarantee that there is a rule r such that p2 = r(p1). This fact does not
invalidate the solution as the goal is to learn the TRS but not the algorithm that
was used to apply it.
The next improvement to reduce the size of the TRS generated is to consider each
edit operation found in any of the edit transcripts separately. For example, let i(x, 4)
denote the insert operation of x at the position 4 3
and s(x, y, 2) the substitution
of y at position 2 with x. Let two edit transcripts be ed1 = i(x, 4), s(x, y, 2) and
ed2 = i(x, 1), i(a, 9), i(b, 10), s(x, y, 12). The first preprocessing step is to group sim-
ilar consecutive edit operations, in this case, ed2 becomes i(x, 1), i(ab, 9), s(x, y, 12).
Then, extracting the individual edit operations from ed1 and ed2 will yield the set:
(i(x, 4), s(x, y, 2), i(x, 1), i(ab, 9), s(x, y, 12))
As mentioned earlier, we are not interested in knowing which rule to apply where as
long as there is a sequence of rewritings maintaining the
∗
↔ relation. We thus can
discard the indexes of where to apply the edit operations, this will yield the much
smaller set: (i(x, ∗), i(ab, ∗), s(x, y, ∗)). This leads us to the solution illustrated by
the following algorithm:
3
The delete operation is the same as insert.
29
31. Data: the malware sample set Sa;
Result: a set of rewrite rules R;
R the rules to be extracted;
for all pairs pi, pj ∈ Sa do
t = the edit transcript of (pi, pj);
group consecutive edits in t;
remove the indexes of edits in t;
for all edit operations edop ∈ t do
if edop /∈ R then
add edop to R
end
end
end
return R
Algorithm 3: Converting edit transcripts to rewrite rules.
Note that while edit operations are equivalent to rewrite rules they are written a
bit differently. The three edit operations insert, delete and substitute are represented
as rewrite rules as follows: i(a, ∗) ⇔ r : ǫ → a, d(a, ∗) ⇔ r : a → ǫ and s(a, b, ∗) ⇔
r : a → b.
This algorithm will generate a smaller TRS than previous solutions explored,
but the rules in R will still be far from the rewriting system used to obfuscate the
programs in Sa. The reasons for this are:
• The algorithm will separate the rule r1 : a → b from the rule r2 : b → a,
however only one of them should be in the “optimal” TRS.
• While grouping consecutive edit operations make the TRS smaller, it could be
the case that instead of the rule r : ǫ → ab, the original TRS has two rules
r1 : ǫ → a and r2 : ǫ → b 4
.
• A generated rule might be the result of multiple applications of different rules.
4
This remark also applies to rules that are product of substitution edit operations.
30
32. For example, let Ro = (r1 : aab → ddc, r2 : dd → ef) be the rules of TRS used
by the metamorphic obfuscation engine. The proposed algorithm might return
the rules in Ro but (most likely) will return also others like r : aab → efc. The
rule r is clearly a composition of r1 and r2 and thus it should be removed from
the generated TRS as it is redundant.
• There might be noise in the rewrite rules generated. The noise can be due to
the previous two remarks, or due to “imperfect” results on the edit transcript
5
.
In order to address the first point, we have to “redirect” all the rewriting rules
with the following rule: |lhs(r)| < |rhs(r)| and if both sides are of equal length to
use the lexicographic order lhs(r) < rhs(r). We call the problem presented in the
second remark contiguous rule applications. It is addressed by an iterative process of
looking for appearances of rules as sub-rules of larger ones described in the following
algorithm. The third remark is about what we call nested rule applications. This
problem has not been addressed yet but we do discuss a strategy to solve it in
Section 7. Finally, in order to overcome noise we use association rule mining (the
support and confidence metrics) to determine which rewriting rules are most likely
part of the original term rewriting system.
Let redirect() be a function that takes a rewrite rule and redirects it according to
the previously described strategy. In the following algorithm we also use ⊆ between
two strings to indicate the substring operation. Finally, along each rewrite rule stored
we also keep the number of appearances of that rule.
5
This is analyzed in Section 5 where we use diff to extract the pairwise differences.
31
33. Data: the malware sample set Sa
support threshold st
confidence threshold ct
Result: a set of rewrite rules R
Rmin the rules extracted by association mining
for all pairs pi, pj ∈ Sa do
t = the edit transcript of (pi, pj)
group consecutive edits in t
remove the indexes of edits in t
for all edit operations edop ∈ t do
add redirect(edop) to R
end
end
Rmin := {r ∈ R : support(r) > st ∧ confidence(r) > ct}
for r1 ∈ Rmin do
for r2 ∈ R ∧ r2 /∈ Rmin do
if (lhs(r1) ⊆ lhs(r2) ∧ rhs(r1) ⊆ rhs(r2)) ∨
(lhs(r1) ⊆ rhs(r2) ∧ rhs(r1) ⊆ lhs(r2)) then
R := R {r2}
remove the occurrence of r1 from r2, r3 = r2 − r1
R := R ∪ {r3}
increase the counter of occurrences of r1 by one
end
end
end
compute support and confidence of all rules in R
update Rmin if any rule not included has passed the support and confidence
thresholds
return Rmin, R
Algorithm 4: The rule learning algorithm.
32
34. Complexity: To analyse the time complexity of the algorithm, we break down to
two main functions. The first one is computing the edit transcripts for all pairs of
samples and the second is processing the rewrite rules. For easier analysis, assume
all samples have the same length l = |s|∀s ∈ Sa. If we use the Needleman-Wrunsh
algorithm to get the pairwise edit transcripts, the first part has a time complexity:
|Sa| ∗ (|Sa| − 1)
2
∗l2
. The complexity of the second part depends on the size of the set
of rules generated at the first step and on the number of rules that will be effectively
recognized using the support and confidence threshold on the first pass. We can
express it as follows: |Rmin|∗(|R|−|Rmin|). Putting both parts together we get that
the proposed rule learning algorithm has complexity
|Sa| ∗ (|Sa| − 1)
2
∗ l2
+ |Rmin| ∗
(|R| − |Rmin|). Because |Rmin| will be (a lot) smaller than |R|, we can simplify the
expression to O ((|Sa| ∗ l)2
∗ |R|).
Assumptions: We now discuss some assumptions that make our algorithm pre-
sented above give good results.
By grouping consecutive edits into a single rewrite rule, we are implicitly making
the following assumption. Let Ro = T , R be the term rewriting system used by
the metamorphic obfuscation engine.
Assumption 4.1. The TRS Ro = T , R is such that T := Σ∗
.
Corollary 4.2.1. Ro is effectively a semi-Thue system as the terms in T contain
only ground terms (strings).
This is a strong assumption and in the case of obfuscation engines only very
simple rewriting rules, such as NOP injectors, would belong to the category of string
rewriting. It is, nevertheless, a useful assumption to make as it allows us to study a
simpler, yet non-trivial problem, which is: learning a semi-Thue system.
• Solving that problem is a step towards the more general result of learning
term rewriting systems. We give some ideas on how to generalise our result in
Section 7.
33
35. • It can be used to identify the simplest obfuscation rewriting rules. The more
occurrences of the lhs of a rule r ∈ R inside a program pi, the higher the relative
impact of the rule r is. Injector rules of the form ǫ → rhs, by definition, can be
applied anywhere and thus have the highest relative impact. Removing simple
rules from the toolset of malware authors is easier than learning more complex
ones and at the same time it greatly reduces the size of Pa.
• With a large enough sample Sa our algorithm should be able to learn simple
rules that include free variables. This will be the case when variables have a
small domain (i.e variables only take as values the name of registers r1–16).
Assumption 4.2. There is a significant part of the code that remains the same
across most variants in Sa.
While this is a strong assumption, it is reasonable to make it. If variants in Sa do
not share any code fragments, the best solution is the naive one. In practice however,
code mutations will not change a program so radically. It is left for future work to
determine the percentage of unchanged code that is necessary for high quality results.
Let Ao be the algorithm, also referred to as rule application strategy, used by the
metamorphic obfuscation engine to apply the rules of Ro to a program p.
Assumption 4.3. Ao implements the random rule — random offset strategy.
The strategy was explained in Section 3.1.2 and can be summarised as “pick
uniformly at random a redex from the set of redexes of all rewrite rules and contract
it”. Under that assumption, it will be unlikely to encounter nested rule applications
as long as the set of all redexes is large enough. It is also an important factor when
calculating the relative impact of each rule, if the selection is not at random the
impact factor lose its importance.
34
36. 5 Implementation
We provide an implementation of both an obfuscation engine on strings and the final
algorithm proposed in Section 4.3. We have chosen Python 2.7 as our implementation
programming language because it provides libraries that implement parts of our
algorithms and its high level syntax allowed us to quickly implement changes of the
algorithm into our code. All of our code have been developed and tested under 64-bit
version of Ubuntu Linux.
Building an obfuscation engine was not the goal of this work, but we have created
one in order to be able to test our rule learning algorithm. It is thus a very simple
rewriting engine that is easy to modify or extend. The python script will:
1. Generate a string a at random from a predefined alphabet and add a to Sa.
2. Pick at random a string p from Sa, a rewrite rule r and an offset ofs.
3. Contract r at the first redex of lhs(r) in p after ofs. If there are no such redexes
do an injection (always applicable rule of the form ǫ → σ) at the offset.
4. If the resulting string is not already in Sa, add it to Sa.
5. While the size of Sa smaller than a threshold, go to 2.
The second python script is an implementation of the rule learning algorithm
described in Section 4.3. The code is very close to the description given in the
previous chapter. To get the unified edit operations from all pairs of input files we
tried both using sequence aligment libraries from the computational biology field and
the diff utility. Both use the same underlying algorithm but the diff algorithm has
additional heuristics that make it a lot faster and also has an output format easier
to use, for those reasons we have chosen it over sequence alignment. We should note
that the implementation does not use the difflib Python package as we ran into cases
where it did produce sub-optimal results. Instead, we have made use of the standard
Linux diff utility, by writing a Python parser to import the results into our program.
Because diff utility was designed to work on text files, it works with entire lines
unlike the sequence alignment that works character by character. In order to make
35
37. diff effective the creation of the initial string and the rewritings done put on each
line a single element from our alphabet Σ. This however should not be considered
as a limitation because if, for example, we wanted to apply out algorithm to x86
assembly that has instructions of variable length, we could put one byte at each line.
5.1 Testing
In this section we describe the parameters used for testing the implementation and
the result we had.
Although our algorithm works on strings containing any character, in order to
simulate a real-world use case and have rules that people familiar with code obfusca-
tions recognise, the examples we present here use an extension of the pseudo-assembly
introduced in our motivating example in Section 1.1. Let our instruction set be:
Σ := {ADD, MUL, JMP, JNE, JEQ, SUB, BUS, NOP, XXX}
The “special” element XXX is added when generating the first string, the archetype
α, to represent parts of the code that does not match any rewriting rule, thus limiting
the places where rules can be applied (see assumption 4.2).
The rules of our obfuscation engine are:
R :=
r1: ǫ → NOP
r2: MUL → ADD.ADD.ADD
r3: ADD → BUS
r4: JMP → ADD.SUB.JEQ
r5: JNE → SUB.JEQ
The testing was conducted as follows: First, generate random string α from
elements of Σ. Then, use the rules in R with the random-rule random-offset strategy
to generate the sample set Sa. Finally, run the implementation of the rule learning
algorithm with a given support and confidence threshold.
We have chosen the values of the support threshold = 0.005 and confidence thresh-
old = 0.75 experimentally. We noticed that a high threshold of confidence is effective
to distinguish “good” rules while there is only need for a very small support (mostly
36
38. to avoid noise). In the Table 1 we present the support and confidence scores of
each rule in R plus the rule with the highest support and the rule with the highest
confidence not in R.
Table 1: Support and confidence of the rules for different sizes of the sample set.
|Sa| R Best from the rest
rule supp. conf. rule supp. conf.
40 r1 0.6458 0.9602 ǫ → JNE 0.0062 0.0092
r2 0.0866 1 ǫ → SUB 0.0062 0.0092
r3 0.0735 1
r4 0.0155 1
r5 0.1519 1
200 r1 0.7797 0.9832 ǫ → JMP 0.003 0.0037
r2 0.0213 0.9482 JEQ → JNE 6.48E-006 1
r3 0.1153 0.992
r4 0.0453 1
r5 0.0228 1
The results we collected from running this experiment multiple times show that
by looking at the support and confidence of the rewrite rules extracted it is easy to
distinguish the rules that are in R. Because our rule learning algorithm does not
have a strategy to process nested rule applications,occasionally we get some rewriting
rules like the following: MUL → BUS.ADD.ADD, JMP → BUS.SUB.JEQ. In
both of those cases, replacing BUS with ADD will transform the rules into one of
the rewrite rules of R.
Adding noise: While the problem formulated in Section 4 requires all the samples
given to the learning algorithm to be rewritings of a single original program, in a
real-world scenario it is unlikely that we could have such a sample set for an external
metamorphic obfuscation engine (it would be possible if the engine is internal or
we somehow have access to it and we can make it generate variants). Thus, it is
interesting to see how our proposed algorithm performs when the sample set includes
37
39. programs/strings that were not generated by rewriting the archetype α.
To do so, we generated a sample set of 100 variants as before and then added
random strings (using the same alphabet Σ). In Table 2 we give the resutls obtained,
we present the confidence and support of the rules in R as well as the rules with the
highest support and confidence not in R.
Table 2: The support and confidence of the rules when adding random strings to the
sample set.
|S a| R Best from the rest
rule supp. conf. rule supp. conf.
120 r1 0.1245 0.1721 JNE.JMP → MUL.ADD.MUL 5.31E-006 1
r2 0.0139 0.3558 ǫ → MUL 0.0389 0.0537
r3 0.0144 0.2687 ǫ → JEQ 0.0388 0.0536
r4 0.0215 0.4439
r5 0.0118 0.3307
140 r1 0.0785 0.1065 JMP.MUL → JNE.SUB.SUB 1.91E-005 1
r2 0.0075 0.2148 ǫ → JEQ 0.0425 0.0577
r3 0.0076 0.1682 ǫ → MUL 0.0418 0.0567
r4 0.0114 0.2683
r5 0.0084 0.2388
160 r1 0.0504 0.0687 JEQ.JEQ → JNE.SUB.MUL 0.0002 1
r2 0.0039 0.124 JNE.ADD.JEQ → JMP.XXX 0.0002 1
r3 0.0044 0.1009 ǫ → JEQ 0.0439 0.0599
r4 0.0073 0.1851
r5 0.0035 0.1086
It is hard to establish a threshold for support and confidence that would work for
any size of sample set and different percentages of noise in the sample set. Instead
of a fixed threshold, it is easier to look for the rewrite rules that have support and
confidence significantly higher than the rest. It could be up to a future work to find
a formula that determines those thresholds. By increasing the amount of random
38
40. strings we add to the input sample set, there is a point after which there is no
combination of support and confidence thresholds that can distinguish the rules of
R. This is the case for the third row where |Sa| = 160.
Performance: The code was developed as a proof-of-concept for the algorithm,
and is not optimised for either memory or speed. It was instead designed to allow
easy modifications as the project evolved. The largest test we have done is with
500 files each with 200 instructions of our pseudo assembly. With this input it took
several hours to complete running on a laptop with a CPU speed of 1300MHz. Should
the the theory evolve to include a wider range of obfuscations, it would be interesting
to develop an optimized version of the code.
39
41. 6 Related Work
There have been many proposed methods for malware classification but no definition
of external metamorphic obfuscation has been given so far and by extension no clas-
sification technique to capture malware obfuscated in that way. Our work defines a
new problem and makes the first steps towards a possible solution of it.
Previous malware classification efforts have been focused on oligomorphic, poly-
morphic and internal metamorphic obfuscations. Researchers have applied multiple
techniques to solve this classification problem. The observation that “A compro-
mised application cannot cause much harm unless it interacts with the underlying
operating system” [43], led many researchers to propose solutions based on classify-
ing programs by the sequence of system calls they are invoking. Both static [43] and
dynamic [19, 22] analysis approaches have been explored using either a manual [14]
or automated [47, 13] learning process.
Another direction followed by researchers looking to improve malware classifica-
tion has been the comparison of control flow graphs (CFG). The process, proposed
by Bonfante, Kaczmarek and Marion, involves extracting the CFG of a program and
proving it is isomorphic to the CFG of a known malware [8, 9]. Vinod et al introduced
the idea to compare control flow graphs by using the longest common subsequence of
basic blocks of code in the CFG of two programs [42]. Finally, Mehra, Jain and Uppal
combined CFG analysis with system calls to automatically select the best features
to do classification [30].
The most relevant, to our work, detection and classification technique is program
normalisation. The goal of program normalisation is to reduce the signature space
by undoing obfuscations in order to get a single (or a small set of) normal form.
Christodorescu et al. proposed a malware normaliser for three common obfusca-
tion rules and used it effectively as a pre-processor for commercial malware detectors
[15]. Bruschi et al. extended the same idea to cover a wider range of obfuscations
[11]. It is noteworthy that while this approach might improve performance of mal-
ware classifier, it also has theoretical limits as shown by Owens [34] who proposes a
way to make non-normalisable functions for metamorphic malware.
40
42. Figure 4: Detecting malware variants using normalisation. Taken from [44].
The closest work related to ours is the was carried out by Walestein et. al [44].
They model metamorphic obfuscation engines as term rewriting systems and present
an algorithm that given the obfuscation rules as a TRS can produce a normalising
TRS, that is convergent and equivalence preserving, for that particular set of obfus-
cations. In their work, obfuscation rules were extracted manually from the mutation
engine that was part of the malware as they were working with internal metamorphic
malware. Such rule extraction technique is expensive, error prone and in the case of
externally obfuscated malware impossible. Thus, we consider our work complemen-
tary to theirs as together they could become a classification technique for external
metamorphic malware. Given an oracle that could solve the problem defined in Sec-
tion 4 and generate the term rewriting system used to obfuscate, we could then use
the technique proposed by Walenstein et al. to turn it into a normaliser for all pro-
grams transformed by that obfuscation engine. While this would present a complete
solution to classification of malware generated by an external obfuscation engine, it
would still have the problem that the Knuth-Bendix completion procedure, used by
Walenstein et al., does not always terminate.
When it comes to the term rewriting system literature, to the best of our knowl-
edge, there have not been any attempts to (approximately) learn the rewriting rules
41
43. of an unknown TRS. Since term rewriting systems are equivalent to turning ma-
chines, the closest work in this space appears in the computational learning theory
field, such as Gold’s Identification in the limit [20] and Valiant’s Theory of Learnable
[41] research, that studies the learnability of different classes of languages/problems.
42
44. 7 Conclusion and Future Work
The novel nature this research lead us to many interesting problems which the limited
time for this thesis did not allow us to pursue. We have mentioned some of them
throughout Section 3 and Section 4, in this section we expand a little more on them
and suggest possible directions that could be followed.
Starting with external metamorphic obfuscations introduced in Section 3, it would
be interesting to study carefully the design space of such obfuscations as done for
internal metamorphic obfuscations by Walenstein et al. [45]. As already mentioned,
malware relying only on external obfuscation would lose their ability to self-mutate in
order to infect new hosts. A viability analysis of such a model could be done, from the
point of view of a malware author. Related to the previous question, in Section 3.2,
we have highlighted the notion of viability of a mutated variant. A more formal
characterisation of viability should be defined and based on it and the properties of
the obfuscating term rewriting system (such as existence of critical pairs/circles) we
could search for lower and upper bounds on the number of distinct possible variants.
Besides possible future work on external metamorphic obfuscation, we have sug-
gestions for improving our work on learning obfuscations as rewrite rules.
In Section 4.3 we have distinguished what we called nested rule applications.
These are the regions where multiple rewriting rules have been applied over many
phases of the obfuscation. Because of this, the extraction of the pairwise difference
might yield a pair (r, l) such that r and l contain sub-terms of different rewrite rules
(either in their lhs or rhs). The problem that could be studied is, given the pair
(r, l) and a set of known (inferred rules with high confidence) rewrite rules R find
the shortest path, if there is one, of rewrite rule applications from the set R between
r and l.
An important limitation already discussed in Section 4.3 is that our proposed
solution will infer string rewriting rules. While this can capture simple rules, in
order to extract with high confidence more complex rules that are context sensitive
and contain variables, a more powerful learning process has to take place. That
is, to go from learning rules that match substrings to rules that match subterms.
43
45. Our preliminary research on this topic gave us two distinct directions that could be
explored in search for learning term rewriting rules, regular expressions and anti-
unification.
Learning regular expressions from a set of lhs of rewrite rules could help discover
the invariant parts of the rule and abstract the rest. Different algorithms have been
proposed for learning deterministic finite automata [5] or more closely related, regular
expressions from positive examples [18]. The drawback of regular expressions is that
unilike anti-unification, the formalism does not permit substitution variables.
Anti-unification is the process of constructing the least general generalisation com-
mon to two given symbolic expressions. Given two terms t1 and t2, anti-unification is
concerned with finding a term t such that both t1 and t2 are instances of t under some
substitutions [29]. With anti-unification, we get terms with distinct variables that
we could not obtain with regular expression learning. On the other hand, in order
for anti-unification to work, the initial terms must have some structure as opposed
to regular expression learning that can be applied to any string. For our particular
use case, this should not be a problem as the assembly obtained by any disassembler
has that structure (each instruction is a function symbol and each register is a vari-
able). We thus believe that anti-unification is more suitable than regular expression
learning and should be the way to learn term rewriting rules.
The word problem described in Section 2.1.1 is in general undecidable but has
been shown decidable for certain groups. The malware classification problem using
normalisation is trying to solve the word problem as it tries to determine if a malware
m1 can be reduced to the normal form n of a known malware m2. In other words,
given n, m2 such that n
∗
↔ m2, test whether n
∗
↔ m1, which is equivalent to testing
if m1
∗
↔ m2. It would be thus interesting from a theoretical perspective to see if
the malware classification using program normalisation could be reduced to a group
where the word problem is decidable. If not, a solution in to this problem will always
include heuristics.
Another interesting work that could be pursued is using the proposed learning
method on variants of known internal metamorphic malware. This could be done with
44
46. the current algorithm and if the improvements suggested above are implemented, a
comparison of the results would help validate the method used. We suggest known
obfuscations because for them we have access to the ground truth, the rules that
have been found after disassembling obfuscation engines.
Following up on the previous suggestion, future work should focus on defining the
obfuscation rewrite rule learning problem for internal obfuscations. With access to
the obfuscation access, even as a black box, it is possible that the learning process
can be shown equivalent to the teacher-student concept used in PAC [27].
Finally, future research could try to use our results both technical and theoretical
in other fields. Possible topics could range form commercial obfuscation for intellec-
tual property protection to plagiarism detection.
45
47. 8 Acknowledgments
First and foremost, I would like to express my gratitude towards my supervisor, Dr.
Earl Barr. More than a supervisor, he has been a true mentor to me by giving me the
right directions to help my thinking progress. Better than giving me answers, he gave
me advice and motivation to work on the interesting problems that we encountered.
For this I will be always grateful to him.
I would also like to thank Dr. David Clark for suggesting me to use association rule
mining. This simple rule learning mechanism turned out to be a perfect match for
what we needed in this work.
I am thankful to Dr. Hector Menendez Benito for giving me his opinion on parts of
my work as well as taking the time to review with me some of the background work.
For helping me with typesetting in LATEX, I would like to thank Zheng Gao.
For her moral support and proofreading work, I am thankful to Vasiliki Meletaki.
Finally, for their support and encouragement to continue my studies I would like to
thank my parents.
46
48. References
[1] Association for Computational Learning. http://www.learningtheory.org/.
Accessed: 15-August-2016.
[2] obfuscate, Merriam–Webster Dictionary. http://www.merriam-webster.com/
dictionary/obfuscate. Accessed: 15-August-2016.
[3] L. M. Adleman. An abstract theory of computer viruses (invited talk). In
Proceedings on Advances in Cryptology, CRYPTO ’88, pages 354–374, New York,
NY, USA, 1990. Springer-Verlag New York, Inc.
[4] Rakesh Agrawal, Tomasz Imieli´nski, and Arun Swami. Mining association rules
between sets of items in large databases. In Acm sigmod record, volume 22,
pages 207–216. ACM, 1993.
[5] Dana Angluin. Queries and concept learning. Machine learning, 2(4):319–342,
1988.
[6] Dana Angluin. Computational learning theory: survey and selected bibliogra-
phy. In Proceedings of the twenty-fourth annual ACM symposium on Theory of
computing, pages 351–369. ACM, 1992.
[7] Franz Baader and Tobias Nipkow. Term rewriting and all that. Cambridge
university press, 1999.
[8] Guillaume Bonfante, Matthieu Kaczmarek, and Jean-Yves Marion. Control
flow graphs as malware signatures. In International workshop on the Theory of
Computer Viruses, 2007.
[9] Guillaume Bonfante, Matthieu Kaczmarek, and Jean-Yves Marion. Morpholog-
ical detection of malware. In Malicious and Unwanted Software, 2008. MAL-
WARE 2008. 3rd International Conference on, pages 1–8. IEEE, 2008.
[10] William W Boone. The word problem. Annals of mathematics, pages 207–265,
1959.
47
49. [11] Danilo Bruschi, Lorenzo Martignoni, and Mattia Monga. Code normalization
for self-mutating malware. IEEE Security and Privacy, 5(2):46–54, 2007.
[12] Mohamed R Chouchane, Andrew Walenstein, and Arun Lakhotia. Statistical
signatures for fast filtering of instruction-substituting metamorphic malware.
In Proceedings of the 2007 ACM workshop on Recurring malcode, pages 31–37.
ACM, 2007.
[13] Mihai Christodorescu, Somesh Jha, and Christopher Kruegel. Mining specifica-
tions of malicious behavior. In Proceedings of the 1st India software engineering
conference, pages 5–14. ACM, 2008.
[14] Mihai Christodorescu, Somesh Jha, Sanjit A Seshia, Dawn Song, and Randal E
Bryant. Semantics-aware malware detection. In 2005 IEEE Symposium on
Security and Privacy (S&P’05), pages 32–46. IEEE, 2005.
[15] Mihai Christodorescu, Johannes Kinder, Somesh Jha, Stefan Katzenbeisser, and
Helmut Veith. Malware normalization. Technical report, University of Wiscon-
sin, 2005.
[16] Max Dauchet. Simulation of turing machines by a left-linear rewrite rule. In
International Conference on Rewriting Techniques and Applications, pages 109–
120. Springer, 1989.
[17] Nachum Dershowitz and Jean-Pierre Jouannaud. Rewrite systems. Citeseer,
1989.
[18] Henning Fernau. Algorithms for learning regular expressions from positive data.
Information and Computation, 207(4):521–541, 2009.
[19] Stephanie Forrest, Steven A Hofmeyr, Anil Somayaji, and Thomas A Longstaff.
A sense of self for unix processes. In Security and Privacy, 1996. Proceedings.,
1996 IEEE Symposium on, pages 120–128. IEEE, 1996.
[20] E Mark Gold. Language identification in the limit. Information and control,
10(5):447–474, 1967.
48
50. [21] Daniel S. Hirschberg. A linear space algorithm for computing maximal common
subsequences. Communications of the ACM, 18(6):341–343, 1975.
[22] Steven A Hofmeyr, Stephanie Forrest, and Anil Somayaji. Intrusion detection
using sequences of system calls. Journal of computer security, 6(3):151–180,
1998.
[23] G´erard Huet. Confluent reductions: Abstract properties and applications to
term rewriting systems: Abstract properties and applications to term rewriting
systems. Journal of the ACM (JACM), 27(4):797–821, 1980.
[24] G´erard Huet and Dallas Lankford. On the uniform halting problem for term
rewriting systems. IRIA. Laboratoire de Recherche en Informatique et Automa-
tique, 1978.
[25] J W Hunt and M D Mcilroy. An Algorithm for Differential File Comparison.
1976.
[26] Richard M Karp. On-line algorithms versus off-line algorithms: How much is
it worth to know the future? In Proceedings of the IFIP 12th World Computer
Congress on Algorithms, Software, Architecture-Information Processing’92, Vol-
ume 1-Volume I, pages 416–429. North-Holland Publishing Co., 1992.
[27] Michael J Kearns and Umesh Virkumar Vazirani. An introduction to computa-
tional learning theory. MIT press, 1994.
[28] Donald E Knuth and Peter B Bendix. Simple word problems in universal alge-
bras. In Automation of Reasoning, pages 342–376. Springer, 1983.
[29] Temur Kutsia, Jordi Levy, and Mateu Villaret. Anti-unification for unranked
terms and hedges. Journal of Automated Reasoning, 52(2):155–190, 2014.
[30] Vishakha Mehra, Vinesh Jain, and Dolly Uppal. Dacomm: Detection and clas-
sification of metamorphic malware. In Communication Systems and Network
Technologies (CSNT), 2015 Fifth International Conference on, pages 668–673.
IEEE, 2015.
49
51. [31] Eugene W Myers. An O(ND) difference algorithm and its variations. Algorith-
mica, 1(1-4):251–266, 1986.
[32] Saul B Needleman and Christian D Wunsch. A general method applicable to
the search for similarities in the amino acid sequence of two proteins. Journal
of molecular biology, 48(3):443–453, 1970.
[33] Philip OKane, Sakir Sezer, and Kieran McLaughlin. Obfuscation: the hidden
malware. IEEE Security & Privacy, 9(5):41–47, 2011.
[34] Rodney Owens and Weichao Wang. Non-normalizable functions: A new method
to generate metamorphic malware. In 2011-MILCOM 2011 Military Communi-
cations Conference, pages 1279–1284. IEEE, 2011.
[35] Dana Ron. Automata Learning and its Applications. PhD thesis, Hebrew Uni-
versity, 1995.
[36] Eric D Simonaire. Sub-circuit selection and replacement algorithms modeled as
term rewriting systems. Technical report, DTIC Document, 2008.
[37] Peter Szor. The art of computer virus research and defense. Pearson Education,
2005.
[38] Yoshihito Toyama. Commutativity of term rewriting systems. Programming of
future generation computers II, pages 393–407, 1988.
[39] Gy¨orgy Tur´an. Remarks on computational learning theory. Annals of Mathe-
matics and Artificial Intelligence, 28(1):43–45, 2000.
[40] Muhammad Afzal Upal. Learning plan rewriting rules. In Proceedings of the
Fourteenth International Florida Artificial Intelligence Research Society Confer-
ence, pages 412–416. AAAI Press, 2001.
[41] Leslie G Valiant. A theory of the learnable. Communications of the ACM,
27(11):1134–1142, 1984.
50
52. [42] P Vinod, Vijay Laxmi, Manoj Singh Gaur, GVSS Kumar, and Yadvendra S
Chundawat. Static cfg analyzer for metamorphic malware code. In Proceedings
of the 2nd international conference on Security of information and networks,
pages 225–228. ACM, 2009.
[43] David Wagner and R Dean. Intrusion detection via static analysis. In Security
and Privacy, 2001. S&P 2001. Proceedings. 2001 IEEE Symposium on, pages
156–168. IEEE, 2001.
[44] Andrew Walenstein, Rachit Mathur, Mohamed R Chouchane, and Arun Lakho-
tia. Normalizing metamorphic malware using term rewriting. In Sixth IEEE
International Workshop on Source Code Analysis and Manipulation, pages 75–
84. IEEE, 2006.
[45] Andrew Walenstein, Rachit Mathur, Mohamed R Chouchane, and Arun Lakho-
tia. The design space of metamorphic malware. In 2nd International Conference
on i-Warfare and Security, pages 241–248, 2007.
[46] Wikipedia. Rewriting — wikipedia, the free encyclopedia. https://en.
wikipedia.org/w/index.php?title=Rewriting&oldid=698782291. Accessed:
24-August-2016.
[47] Qinghua Zhang and Douglas S Reeves. Metaaware: Identifying metamorphic
malware. In Computer Security Applications Conference, 2007. ACSAC 2007.
Twenty-Third Annual, pages 411–420. IEEE, 2007.
[48] Zhi-hong Zuo, Qing-xin Zhu, and Ming-tian Zhou. On the time complexity of
computer viruses. IEEE Transactions on information theory, 51(8):2962–2966,
2005.
51