The document discusses an improved method for storing feature vectors to detect Android malware. It proposes using a compressed row storage format to efficiently store the statistical features that represent malware families. This involves storing only the non-zero elements of sparse feature matrices in three vectors, which reduces storage needs by 79% compared to conventional methods. This improved storage technique leads to reduced processing time for feature vector generation and malware detection overall. The proposed method aims to enhance Android malware analysis by making feature vector searches and classification faster.
Finding Bad Code Smells with Neural Network Models IJECEIAES
Code smell refers to any symptom introduced in design or implementation phases in the source code of a program. Such a code smell can potentially cause deeper and serious problems during software maintenance. The existing approaches to detect bad smells use detection rules or standards using a combination of different object-oriented metrics. Although a variety of software detection tools have been developed, they still have limitations and constraints in their capabilities. In this paper, a code smell detection system is presented with the neural network model that delivers the relationship between bad smells and object-oriented metrics by taking a corpus of Java projects as experimental dataset. The most well-known objectoriented metrics are considered to identify the presence of bad smells. The code smell detection system uses the twenty Java projects which are shared by many users in the GitHub repositories. The dataset of these Java projects is partitioned into mutually exclusive training and test sets. The training dataset is used to learn the network model which will predict smelly classes in this study. The optimized network model will be chosen to be evaluated on the test dataset. The experimental results show when the modelis highly trained with more dataset, the prediction outcomes are improved more and more. In addition, the accuracy of the model increases when it performs with higher epochs and many hidden layers.
Traffic analysis is a process of great importance, when it comes in securing a network. This analysis can be classified in
different levels and one of most interest is Deep Packet Inspection (DPI). DPI is a very effective way of monitoring the network,
since it performs traffic control over mostly of the OSI model’s layers (from L3 to L7). Regular Expressions (RegExp) on the
other hand is used in computer science and can make use of a group of characters, in order to create a searching pattern. This
technique can be combined with a series of mathematical algorithms for helping the individual to quickly find out the search
pattern within a text and even replace it with another value.
In this paper, we aim to prove that the use of Regular Expressions is much more productive and effective when used for
creating matching rules needed in DPI. We design, test and put into comparison Regular Expression rules and compare it
against the conventional methods. In addition to the above, we have created a case study of detecting EternalBlue and
DoublePulsar threats, in order to point out the practical and realistic value of our proposal.
Finding Bad Code Smells with Neural Network Models IJECEIAES
Code smell refers to any symptom introduced in design or implementation phases in the source code of a program. Such a code smell can potentially cause deeper and serious problems during software maintenance. The existing approaches to detect bad smells use detection rules or standards using a combination of different object-oriented metrics. Although a variety of software detection tools have been developed, they still have limitations and constraints in their capabilities. In this paper, a code smell detection system is presented with the neural network model that delivers the relationship between bad smells and object-oriented metrics by taking a corpus of Java projects as experimental dataset. The most well-known objectoriented metrics are considered to identify the presence of bad smells. The code smell detection system uses the twenty Java projects which are shared by many users in the GitHub repositories. The dataset of these Java projects is partitioned into mutually exclusive training and test sets. The training dataset is used to learn the network model which will predict smelly classes in this study. The optimized network model will be chosen to be evaluated on the test dataset. The experimental results show when the modelis highly trained with more dataset, the prediction outcomes are improved more and more. In addition, the accuracy of the model increases when it performs with higher epochs and many hidden layers.
Traffic analysis is a process of great importance, when it comes in securing a network. This analysis can be classified in
different levels and one of most interest is Deep Packet Inspection (DPI). DPI is a very effective way of monitoring the network,
since it performs traffic control over mostly of the OSI model’s layers (from L3 to L7). Regular Expressions (RegExp) on the
other hand is used in computer science and can make use of a group of characters, in order to create a searching pattern. This
technique can be combined with a series of mathematical algorithms for helping the individual to quickly find out the search
pattern within a text and even replace it with another value.
In this paper, we aim to prove that the use of Regular Expressions is much more productive and effective when used for
creating matching rules needed in DPI. We design, test and put into comparison Regular Expression rules and compare it
against the conventional methods. In addition to the above, we have created a case study of detecting EternalBlue and
DoublePulsar threats, in order to point out the practical and realistic value of our proposal.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
A NOVEL APPROACH TO ERROR DETECTION AND CORRECTION OF C PROGRAMS USING MACHIN...IJCI JOURNAL
There has always been a struggle for programmers to identify the errors while executing a program- be it
syntactical or logical error. This struggle has led to a research in identification of syntactical and logical
errors. This paper makes an attempt to survey those research works which can be used to identify errors as
well as proposes a new model based on machine learning and data mining which can detect logical and
syntactical errors by correcting them or providing suggestions. The proposed work is based on use of
hashtags to identify each correct program uniquely and this in turn can be compared with the logically
incorrect program in order to identify errors.
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)IJCSEA Journal
Data mining is defined as the process of extracting or mining knowledge from vast and large database.Data mining is an interdisciplinary field that brings together techniques from machine learning, pattern recognition, statistics, databases, and visualization to address the issue of information extraction from large databases. Bioinformatics is defined as the science of organizing and analyzing the biological data. Microarray technology helps biologists for monitoring expression of thousands of genes in a single experiment on a small chip. Microarray is also called as DNA chip, gene chip, or biochip is used to analyze the gene expression profiles. Fuzzy Logic is defined as a multivalued logic that provides the intermediate values to be defined between conventional evaluations like true or false, yes or no, high or low, etc.In this paper, a type 2 fuzzy logic approach is used in microarray gene expression data to convert the numerical values into fuzzy terms. After fuzzification, the fuzzy association patterns are discovered. A framework is proposed to cluster microarray gene data based on fuzzy association patterns. Then the proposed type 2
fuzzy approach is compared with traditional clustering algorithms.
Multibiometric Secure Index Value Code Generation for Authentication and Retr...ijsrd.com
The use of multiple biometric sources for human recognition, referred to as multibiometrics, mitigates some of the limitations of unimodal biometric systems by increasing recognition accuracy, improving population coverage, imparting fault-tolerance, and enhancing security. In a biometric identification system, the identity corresponding to the input data (probe) is typically determined by comparing it against the templates of all identities in a database (gallery). An alternative e approach is to limit the number of identities against which matching is performed based on criteria that are fast to evaluate. We propose a method for generating fixed-length codes for indexing biometric databases. An index code is constructed by computing match scores between a biometric image and a fixed set of reference images. Candidate identities are retrieved based on the similarity between the index code of the probe image and those of the identities in the database. The number of multibiometric systems deployed on a national scale is increasing and the sizes of the underlying databases are growing. These databases are used extensively, thereby requiring efficient ways for searching and retrieving relevant identities. Searching a biometric database for an identity is usually done by comparing the probe image against every enrolled identity in the database and generating a ranked list of candidate identities. Depending on the nature of the matching algorithm, the matching speed in some systems can be slow. The proposed technique can be easily extended to retrieve pertinent identities from multimodal databases. Experiments on a chimeric face and fingerprint bimodal database resulted in an 84% average reduction in the search space at a hit rate of 100%. These results suggest that the proposed indexing scheme has the potential to substantially reduce the response time without compromising the accuracy of identification. New representation schemes that allow for faster search and, therefore, shorter response time are needed.
A Review on Grammar-Based Fuzzing TechniquesCSCJournals
Fuzzing has become the most interesting software testing technique because it can find different types of bugs and vulnerabilities in many target programs. Grammar-based fuzzing tools have been shown effectiveness in finding bugs and generating good fuzzing files. Fuzzing techniques are usually guided by different methods to improve their effectiveness. However, they have limitation as well. In this paper, we present an overview of grammar-based fuzzing tools and techniques that are used to guide them which include mutation, machine learning, and evolutionary computing. Few studies are conducted on this approach and show the effectiveness and quality in exploring new vulnerabilities in a program. Here we summarize the studied fuzzing tools and explain each one method, input format, strengths and limitations. Some experiments are conducted on two of the fuzzing tools and comparing between them based on the quality of generated fuzzing files.
The paper presents a k-means based semi-supervised clustering approach for
recognizing and classifying P300 signals for BCI Speller System. P300 signals are proved to
be the most suitable Event Related Potential (ERP) signal, used to develop the BCI systems.
Due to non-stationary nature of ERP signals, the wavelet transform is the best analysis tool
for extracting informative features from P300 signals. The focus of the research is on semi-
supervised clustering as supervised clustering approach need large amount of labeled data
for training, which is a tedious task. Hence works for small-labeled datasets to train
classifiers. On the other hand, unsupervised clustering works when no prior information is
available i.e. totally unlabeled data. Thus leads to low level of performance. The in-between
solution is to use semi-supervised clustering, which uses a few labeled with large unlabeled
data causes less trouble and time. The authors have selected and defined adhoc features and
assumed the Clusters for small datasets. This motivates us to propose a novel approach that
discovers the features embedded in P300 (EEG) signals, using an k-means based semi-
supervised cluster classification using ensemble SVM
ANALYSIS OF MACHINE LEARNING ALGORITHMS WITH FEATURE SELECTION FOR INTRUSION ...IJNSA Journal
In recent times, various machine learning classifiers are used to improve network intrusion detection. The researchers have proposed many solutions for intrusion detection in the literature. The machine learning classifiers are trained on older datasets for intrusion detection, which limits their detection accuracy. So, there is a need to train the machine learning classifiers on the latest dataset. In this paper, UNSW-NB15, the latest dataset is used to train machine learning classifiers. The selected classifiers such as K-Nearest Neighbors (KNN), Stochastic Gradient Descent (SGD), Random Forest (RF), Logistic Regression (LR), and Naïve Bayes (NB) classifiers are used for training from the taxonomy of classifiers based on lazy and eager learners. In this paper, Chi-Square, a filter-based feature selection technique, is applied to the UNSW-NB15 dataset to reduce the irrelevant and redundant features. The performance of classifiers is measured in terms of Accuracy, Mean Squared Error (MSE), Precision, Recall, F1-Score, True Positive Rate (TPR) and False Positive Rate (FPR) with or without feature selection technique and comparative analysis of these machine learning classifiers is carried out.
Bioinformatics may be defined as the field of science
in which biology, computer science, and information
technology merge to form a single discipline. Its ultimate
goal is to enable the discovery of new biological insights as
well as to create a global perspective from which unifying
principles in biology can be discerned by means of
bioinformatics tools for storing, retrieving, organizing and
analyzing biological data. Also most of these tools possess
very distinct features and capabilities making a direct
comparison difficult to be done. In this paper we propose
taxonomy for characterizing bioinformatics tools and briefly
surveys major bioinformatics tools under each categories.
Hopefully this study will stimulate other designers
and
experienced end users understand the details of particular
tool categories/tools, enabling them to make the best choices
for their particular research interests.
Role of soluble urokinase plasminogen activator receptor (suPAR) as prognosis...IOSR Journals
Biological marker suPAR was used in many pathological conditions, including infection. suPAR
was correlated with the severity of sepsis. The purpose of this study to determine levels of suPAR infants with
risk of infection as a prognostic indicator for sepsis. Groups of infants with the risk of infection (n = 43) were
followed prospectively on days 0, 3rd and 7th and observed for the incidence of sepsis compared to the control
group (n = 10). suPAR was measured by ELISA and the course of infection measured by clinical criteria.
Results suPAR day 0, 3 and 7, displayed in the form of bloxpot and AUC as prognostic power. suPAR control
levels 9.32 ng / mL, sepsis cutoff 15, 41 ng / mL and AUC of 80.3% [95% CI 65.7%, 94.9%, p = 0.00]. Graph
shows ROC AUC sepsis suPAR day 0, the 3rd and 7th respectively 61.9%, 66.6% and 94.4%. Sepsis with
improved output 16.53 ng / mL and worsening 22.19 ng / mL and AUC of 80.8% [95% CI (0.62 to 0.99), p =
0.02]. suPAR levels was increased in neonatal sepsis patients. suPAR could be used as a prognostic factor for
neonatal sepsis.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
A NOVEL APPROACH TO ERROR DETECTION AND CORRECTION OF C PROGRAMS USING MACHIN...IJCI JOURNAL
There has always been a struggle for programmers to identify the errors while executing a program- be it
syntactical or logical error. This struggle has led to a research in identification of syntactical and logical
errors. This paper makes an attempt to survey those research works which can be used to identify errors as
well as proposes a new model based on machine learning and data mining which can detect logical and
syntactical errors by correcting them or providing suggestions. The proposed work is based on use of
hashtags to identify each correct program uniquely and this in turn can be compared with the logically
incorrect program in order to identify errors.
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)IJCSEA Journal
Data mining is defined as the process of extracting or mining knowledge from vast and large database.Data mining is an interdisciplinary field that brings together techniques from machine learning, pattern recognition, statistics, databases, and visualization to address the issue of information extraction from large databases. Bioinformatics is defined as the science of organizing and analyzing the biological data. Microarray technology helps biologists for monitoring expression of thousands of genes in a single experiment on a small chip. Microarray is also called as DNA chip, gene chip, or biochip is used to analyze the gene expression profiles. Fuzzy Logic is defined as a multivalued logic that provides the intermediate values to be defined between conventional evaluations like true or false, yes or no, high or low, etc.In this paper, a type 2 fuzzy logic approach is used in microarray gene expression data to convert the numerical values into fuzzy terms. After fuzzification, the fuzzy association patterns are discovered. A framework is proposed to cluster microarray gene data based on fuzzy association patterns. Then the proposed type 2
fuzzy approach is compared with traditional clustering algorithms.
Multibiometric Secure Index Value Code Generation for Authentication and Retr...ijsrd.com
The use of multiple biometric sources for human recognition, referred to as multibiometrics, mitigates some of the limitations of unimodal biometric systems by increasing recognition accuracy, improving population coverage, imparting fault-tolerance, and enhancing security. In a biometric identification system, the identity corresponding to the input data (probe) is typically determined by comparing it against the templates of all identities in a database (gallery). An alternative e approach is to limit the number of identities against which matching is performed based on criteria that are fast to evaluate. We propose a method for generating fixed-length codes for indexing biometric databases. An index code is constructed by computing match scores between a biometric image and a fixed set of reference images. Candidate identities are retrieved based on the similarity between the index code of the probe image and those of the identities in the database. The number of multibiometric systems deployed on a national scale is increasing and the sizes of the underlying databases are growing. These databases are used extensively, thereby requiring efficient ways for searching and retrieving relevant identities. Searching a biometric database for an identity is usually done by comparing the probe image against every enrolled identity in the database and generating a ranked list of candidate identities. Depending on the nature of the matching algorithm, the matching speed in some systems can be slow. The proposed technique can be easily extended to retrieve pertinent identities from multimodal databases. Experiments on a chimeric face and fingerprint bimodal database resulted in an 84% average reduction in the search space at a hit rate of 100%. These results suggest that the proposed indexing scheme has the potential to substantially reduce the response time without compromising the accuracy of identification. New representation schemes that allow for faster search and, therefore, shorter response time are needed.
A Review on Grammar-Based Fuzzing TechniquesCSCJournals
Fuzzing has become the most interesting software testing technique because it can find different types of bugs and vulnerabilities in many target programs. Grammar-based fuzzing tools have been shown effectiveness in finding bugs and generating good fuzzing files. Fuzzing techniques are usually guided by different methods to improve their effectiveness. However, they have limitation as well. In this paper, we present an overview of grammar-based fuzzing tools and techniques that are used to guide them which include mutation, machine learning, and evolutionary computing. Few studies are conducted on this approach and show the effectiveness and quality in exploring new vulnerabilities in a program. Here we summarize the studied fuzzing tools and explain each one method, input format, strengths and limitations. Some experiments are conducted on two of the fuzzing tools and comparing between them based on the quality of generated fuzzing files.
The paper presents a k-means based semi-supervised clustering approach for
recognizing and classifying P300 signals for BCI Speller System. P300 signals are proved to
be the most suitable Event Related Potential (ERP) signal, used to develop the BCI systems.
Due to non-stationary nature of ERP signals, the wavelet transform is the best analysis tool
for extracting informative features from P300 signals. The focus of the research is on semi-
supervised clustering as supervised clustering approach need large amount of labeled data
for training, which is a tedious task. Hence works for small-labeled datasets to train
classifiers. On the other hand, unsupervised clustering works when no prior information is
available i.e. totally unlabeled data. Thus leads to low level of performance. The in-between
solution is to use semi-supervised clustering, which uses a few labeled with large unlabeled
data causes less trouble and time. The authors have selected and defined adhoc features and
assumed the Clusters for small datasets. This motivates us to propose a novel approach that
discovers the features embedded in P300 (EEG) signals, using an k-means based semi-
supervised cluster classification using ensemble SVM
ANALYSIS OF MACHINE LEARNING ALGORITHMS WITH FEATURE SELECTION FOR INTRUSION ...IJNSA Journal
In recent times, various machine learning classifiers are used to improve network intrusion detection. The researchers have proposed many solutions for intrusion detection in the literature. The machine learning classifiers are trained on older datasets for intrusion detection, which limits their detection accuracy. So, there is a need to train the machine learning classifiers on the latest dataset. In this paper, UNSW-NB15, the latest dataset is used to train machine learning classifiers. The selected classifiers such as K-Nearest Neighbors (KNN), Stochastic Gradient Descent (SGD), Random Forest (RF), Logistic Regression (LR), and Naïve Bayes (NB) classifiers are used for training from the taxonomy of classifiers based on lazy and eager learners. In this paper, Chi-Square, a filter-based feature selection technique, is applied to the UNSW-NB15 dataset to reduce the irrelevant and redundant features. The performance of classifiers is measured in terms of Accuracy, Mean Squared Error (MSE), Precision, Recall, F1-Score, True Positive Rate (TPR) and False Positive Rate (FPR) with or without feature selection technique and comparative analysis of these machine learning classifiers is carried out.
Bioinformatics may be defined as the field of science
in which biology, computer science, and information
technology merge to form a single discipline. Its ultimate
goal is to enable the discovery of new biological insights as
well as to create a global perspective from which unifying
principles in biology can be discerned by means of
bioinformatics tools for storing, retrieving, organizing and
analyzing biological data. Also most of these tools possess
very distinct features and capabilities making a direct
comparison difficult to be done. In this paper we propose
taxonomy for characterizing bioinformatics tools and briefly
surveys major bioinformatics tools under each categories.
Hopefully this study will stimulate other designers
and
experienced end users understand the details of particular
tool categories/tools, enabling them to make the best choices
for their particular research interests.
Role of soluble urokinase plasminogen activator receptor (suPAR) as prognosis...IOSR Journals
Biological marker suPAR was used in many pathological conditions, including infection. suPAR
was correlated with the severity of sepsis. The purpose of this study to determine levels of suPAR infants with
risk of infection as a prognostic indicator for sepsis. Groups of infants with the risk of infection (n = 43) were
followed prospectively on days 0, 3rd and 7th and observed for the incidence of sepsis compared to the control
group (n = 10). suPAR was measured by ELISA and the course of infection measured by clinical criteria.
Results suPAR day 0, 3 and 7, displayed in the form of bloxpot and AUC as prognostic power. suPAR control
levels 9.32 ng / mL, sepsis cutoff 15, 41 ng / mL and AUC of 80.3% [95% CI 65.7%, 94.9%, p = 0.00]. Graph
shows ROC AUC sepsis suPAR day 0, the 3rd and 7th respectively 61.9%, 66.6% and 94.4%. Sepsis with
improved output 16.53 ng / mL and worsening 22.19 ng / mL and AUC of 80.8% [95% CI (0.62 to 0.99), p =
0.02]. suPAR levels was increased in neonatal sepsis patients. suPAR could be used as a prognostic factor for
neonatal sepsis.
IOSR Journal of Mathematics(IOSR-JM) is an open access international journal that provides rapid publication (within a month) of articles in all areas of mathemetics and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in mathematics. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
IOSR Journal of Electronics and Communication Engineering(IOSR-JECE) is an open access international journal that provides rapid publication (within a month) of articles in all areas of electronics and communication engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in electronics and communication engineering. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Abstract: An audio mixer amplifier is a device that translates a signal of one frequency band to another. It will accept many inputs at different frequencies and generate an output of the combination or sum of the frequencies. The mixer circuit provides good gain to weak audio signals. It can be used in front of an R.F. oscillator to make an R.F. receiver that is very sensitive to sound. Each input can be independently controlled by a variable resistor. There is also a provision for a balance control to fade out signal while simultaneously fading in the other. Key Words: Audio Mixer, Frequency, Signal, Circuit
Android is a Linux based operating system used for smart phone devices. Since 2008, Android devices gained huge market share due to its open architecture and popularity. Increased popularity of the Android devices and associated primary benefits attracted the malware developers. Rate of Android malware applications increased between 2008 and 2016. In this paper, we proposed dynamic malware detection approach for Android applications. In dynamic analysis, system calls are recorded to calculate the density of the system calls. For density calculation, we used two different lengths of system calls that are 3 gram and 5 gram. Furthermore, Naive Bayes algorithm is applied to classify applications as benign or malicious. The proposed algorithm detects malware using 100 real world samples of benign and malware applications. We observe that proposed method gives effective and accurate results. The 3 gram Naive Bayes algorithm detects 84 malware application correctly and 14 benign application incorrectly. The 5 gram Naive Bayes algorithm detects 88 malware application correctly and 10 benign application incorrectly. Mr. Tushar Patil | Prof. Bharti Dhote "Malware Detection in Android Applications" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26449.pdfPaper URL: https://www.ijtsrd.com/engineering/computer-engineering/26449/malware-detection-in-android-applications/mr-tushar-patil
MALWARE DETECTION USING MACHINE LEARNING ALGORITHMS AND REVERSE ENGINEERING O...IJNSA Journal
This research paper is focused on the issue of mobile application malware detection by Reverse Engineering of Android java code and use of Machine Learning algorithms. The malicious software characteristics were identified based on a collected set of total number of 1958 applications (including 996 malware applications). During research a unique set of features was chosen, then three attribute selection algorithms and five classification algorithms (Random Forest, K Nearest Neighbors, SVM, Nave Bayes and Logistic Regression) were examined to choose algorithms that would provide the most effective rate of malware detection.
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWAREIJNSA Journal
In the era of information technology and connected world, detecting malware has been a major security concern for individuals, companies and even for states. The New generation of malware samples upgraded with advanced protection mechanism such as packing, and obfuscation frustrate anti-virus solutions. API call analysis is used to identify suspicious malicious behavior thanks to its description capability of a software functionality. In this paper, we propose an effective and efficient malware detection method that uses sequential pattern mining algorithm to discover representative and discriminative API call patterns. Then, we apply three machine learning algorithms to classify malware samples. Based on the experimental results, the proposed method assures favorable results with 0.999 F-measure on a dataset including 8152 malware samples belonging to 16 families and 523 benign samples.
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWAREIJNSA Journal
In the era of information technology and connected world, detecting malware has been a major security concern for individuals, companies and even for states. The New generation of malware samples upgraded with advanced protection mechanism such as packing, and obfuscation frustrate anti-virus solutions. API call analysis is used to identify suspicious malicious behavior thanks to its description capability of a
software functionality. In this paper, we propose an effective and efficient malware detection method that uses sequential pattern mining algorithm to discover representative and discriminative API call patterns. Then, we apply three machine learning algorithms to classify malware samples. Based on the experimental results, the proposed method assures favorable results with 0.999 F-measure on a dataset including 8152
malware samples belonging to 16 families and 523 benign samples.
COMPARISON OF MALWARE CLASSIFICATION METHODS USING CONVOLUTIONAL NEURAL NETWO...IJNSA Journal
Malicious software is constantly being developed and improved, so detection and classification of malwareis an ever-evolving problem. Since traditional malware detection techniques fail to detect new/unknown malware, machine learning algorithms have been used to overcome this disadvantage. We present a Convolutional Neural Network (CNN) for malware type classification based on the API (Application Program Interface) calls. This research uses a database of 7107 instances of API call streams and 8 different malware types:Adware, Backdoor, Downloader, Dropper, Spyware, Trojan, Virus,Worm. We used a 1-Dimensional CNN by mapping API calls as categorical and term frequency-inverse document frequency (TF-IDF) vectors and compared the results to other classification techniques.The proposed 1-D CNN outperformed other classification techniques with 91% overall accuracy for both categorical and TF-IDF vectors.
Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...CSCJournals
Some malware are sophisticated with polymorphic techniques such as self-mutation and emulation based analysis evasion. Most anti-malware techniques are overwhelmed by the polymorphic malware threats that self-mutate with different variants at every attack. This research aims to contribute to the detection of malicious codes, especially polymorphic malware by utilizing advanced static and advanced dynamic analyses for extraction of more informative key features of a malware through code analysis, memory analysis and behavioral analysis. Correlation based feature selection algorithm will be used to transform features; i.e. filtering and selecting optimal and relevant features. A machine learning technique called K-Nearest Neighbor (K-NN) will be used for classification and detection of polymorphic malware. Evaluation of results will be based on the following measurement metrics-True Positive Rate (TPR), False Positive Rate (FPR) and the overall detection accuracy of experiments.
DROIDSWAN: Detecting Malicious Android Applications Based on Static Feature A...csandit
Android being a widely used mobile platform has witnessed an increase in the number of malicious samples on its market place. The availability of multiple sources for downloading
applications has also contributed to users falling prey to malicious applications. Classification of an Android application as malicious or benign remains a challenge as malicious applications maneuver to pose themselves as benign. This paper presents an approach which extracts various features from Android Application Package file (APK) using static analysis and subsequently classifies using machine learning techniques. The contribution of this work includes deriving, extracting and analyzing crucial features of Android applications that aid in efficient classification. The analysis is carried out using various machine learning algorithms
with both weighted and non-weighted approaches. It was observed that weighted approach depicts higher detection rates using fewer features. Random Forest algorithm exhibited high detection rate and shows the least false positive rate.
MACHINE LEARNING APPLICATIONS IN MALWARE CLASSIFICATION: A METAANALYSIS LITER...IJCI JOURNAL
With a text mining and bibliometrics approach, this study reviews the literature on the evolution
of malware classification using machine learning. This work takes literature from 2008 to 2022
on the subject of using machine learning for malware classification to understand the impact of
this technology on malware classification. Throughout this study, we seek to answer three main
research questions: RQ1: Is the application of machine learning for malware classification
growing? RQ2: What is the most common machine-learning application for malware
classification? RQ3: What are the outcomes of the most common machine learning
applications? The analysis of 2186 articles resulting from a data collection process from peerreviewed databases shows the trajectory of the application of this technology on malware
classification as well as trends in both the machine learning and malware classification fields of
study. This study performs quantitative and qualitative analysis using statistical and N-gram
analysis techniques and a formal literature review to answer the proposed research questions.
The research reveals methods such as support vector machines and random forests to be
standard machine learning methods for malware classification in efforts to detect maliciousness
or categorize malware by family. Machine learning is a highly researched technology with
many applications, from malware classification and beyond.
A FRAMEWORK FOR THE DETECTION OF BANKING TROJANS IN ANDROIDIJNSA Journal
Android is the most widely used operating system today and occupies more than 70% share of the smartphone market. It is also a popular target for attackers looking to exploit mobile operating systems for personal gains. More and more malware are targeting android operating system like Android Banking Trojans (ABTs) which are widely being discovered. To detect such malware, we propose a prediction model for ABTs that is based on hybrid analysis. The feature sets used with the machine learning algorithms are permissions, API calls, hidden application icon and device administrator. Feature selection methods based on frequency and gain ratio are used to minimize the number of features as well as to eliminate the low-impact features. The proposed system is able to achieve significant performance with selected machine learning algorithms and achieves accuracy up to 98% using random forest classifier.
Today’s threats have become very complex and serious in their packing and encryption techniques. Every day new malware variants are becoming increasingly in quantity together with quality by using packing and encrypting techniques. The challenges in this research field are the traditional malware detection systems sometimes might fail to detect new malware variants and produces false alarms. Malicious software in the form of virus, worm, trojan, ransom, and spy harms our computer systems, network environment, and organizations in various ways. Therefore, malware analysis for detection and family classification plays a significant role in Cyber Crime Incident Handling Systems. This system contributes malware family classification with 10 prominent features by conduction feature selection process. The process of labeling the malicious samples using Regular Expressions has been contributed in this approach. The proposed malware classification system provides 7 different families including malware and benign using machine learning classifiers. The finding from our experiment proves that the selected 10 API features provide the best evaluation metrics in terms of accuracy, precision-recall, and ROC scores.
Android-manifest extraction and labeling method for malware compilation and d...IJECEIAES
Malware is a nuisance for smartphone users. The impact is detrimental to smartphone users if the smartphone is infected by malware. Malware identification is not an easy process for ordinary users due to its deeply concealed dangers in application package kit (APK) files available in the Android Play Store. In this paper, the challenges of creating malware datasets are discussed. Long before a malware classification process and model can be built, the need for datasets with representative features for most types of malwares has to be addressed systematically. Only after a quality data set is available can a quality classification model be obtained using machine learning (ML) or deep learning (DL) algorithms. The entire malware classification process is a full pipeline process and sub processes. The authors purposefully focus on the process of building quality malware datasets, not on ML itself, because implementing ML requires another effort after the reliable dataset is fully built. The overall step in creating the malware dataset starts with the extraction of the Android Manifest from the APK file set and ends with the labeling method for all the extracted APK files. The key contribution of this paper is on how to generate datasets systematically from any APK file.
ANDROINSPECTOR: A SYSTEM FOR COMPREHENSIVE ANALYSIS OF ANDROID APPLICATIONSIJNSA Journal
Android is an extensively used mobile platform and with evolution it has also witnessed an increased influx of malicious applications in its market place. The availability of multiple sources for downloading applications has also contributed to users falling prey to malicious applications. A major hindrance in blocking the entry of malicious applications into the Android market place is scarcity of effective mechanisms to identify malicious applications. This paper presents AndroInspector, a system for comprehensive analysis of an Android application using both static and dynamic analysis techniques. AndroInspector derives, extracts and analyses crucial features of Android applications using static analysis and subsequently classifies the application using machine learning techniques. Dynamic analysis includes automated execution of Android application to identify a set of pre-defined malicious actions performed by application at run-time.
Android is an extensively used mobile platform and with evolution it has also witnessed an increased influx of malicious applications in its market place. The availability of multiple sources for downloading applications has also contributed to users falling prey to malicious applications. A major hindrance in blocking the entry of malicious applications into the Android market place is scarcity of effective mechanisms to identify malicious applications. This paper presents AndroInspector, a system for comprehensive analysis of an Android application using both static and dynamic analysis techniques. And roInspector derives, extracts and analyses crucial features of Android applications using static analysis and subsequently classifies the application using machine learning techniques. Dynamic analysis includes automated execution of Android application to identify a set of pre-defined malicious actions performed by application at run-time.
Abstract: The exponential growth of the internet and new technology lead today's world in a hectic situation both positive as well as the negative module. Cybercriminals gamble in the dark net using numerous techniques. This leads to cybercrime. Cyber threats like Malware attempt to infiltrate the computer or mobile device offline or internet, chat(online), and anyone can be a potential target. Malware is also known as malicious software is often used by cybercriminals to achieve their goal by tracking internet activity, capturing sensitive information, or blocking computer access. Reverse engineering is one of the best ways to prevent and is a powerful tool to keep the fight against cyber attacks. Most people in the cyber world see it as a black hat—It is said as being used to steal data and intellectual property. But when it is in the hands of cybersecurity experts, reverse engineering dons the white hat of the hero. Looking at the program from the outside in –often by a third party that had no hand in writing the code. It allows those who practice it to understand how a given program or system works when no source code is available. Reverse engineering accomplishing several tasks related to cybersecurity: finding system vulnerabilities, researching malware &analyzing the complexity of restoring core software algorithms that can further protect against theft. It is hard to hack certain software.
Keywords: Malware, threat, vulnerablity, detection, reverse engineering, analysis.
Title: Malware analysis and detection using reverse Engineering
Author: B.Rashmitha, J. Alwina Beauty Angelin, E.R. Ramesh
International Journal of Computer Science and Information Technology Research
ISSN 2348-1196 (print), ISSN 2348-120X (online)
Vol. 10, Issue 2, Month: April 2022 - June 2022
Page: (1-4)
Published Date: 01-April-2022
Research Publish Journals
Available at: www.researchpublish.com
You can Direct download full research paper at given below link:
https://www.researchpublish.com/papers/malware-analysis-and-detection-using-reverse-engineering
Academia Link: https://www.academia.edu/76069664/Malware_analysis_and_detection_using_reverse_Engineering_Available_at_www_researchpublish_com_journal_name_International_Journal_of_Computer_Science_and_Information_Technology_Research
Optimised Malware Detection in Digital Forensics IJNSA Journal
On the Internet, malware is one of the most serious threats to system security. Most complex issues and problems on any systems are caused by malware and spam. Networks and systems can be accessed and compromised by malware known as botnets, which compromise other systems through a coordinated attack. Such malware uses anti-forensic techniques to avoid detection and investigation. To prevent systems from the malicious activity of this malware, a new framework is required that aims to develop an optimised technique for malware detection. Hence, this paper demonstrates new approaches to perform malware analysis in forensic investigations and discusses how such a framework may be developed.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
H017445260
1. IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. IV (July – Aug. 2015), PP 52-60
www.iosrjournals.org
DOI: 10.9790/0661-17445260 www.iosrjournals.org 52 | Page
An Improved Feature Vector storage metric for fast Android
Malware Detection Framework
Mohit Sharma1
, Meenu Chawla2
, Vinesh Jain3
, Jyoti Gajrani4
1, 2
(Dept. of CSE, Maulana Azad National Institute of Technology, Bhopal, India)
3, 4
(Dept. of CSE, Government Engineering College, Ajmer, India)
Abstract: Android based devices are rapidly flourishing day-by-day, due to its ease of use and popularity. As a
result, the number of malware attacks on Android is also increasing. This paper is based on the Text Mining
approach for analyzing Android malware families. The proposed methodology is motivated by the method
introduced by Guillermo Suarez-Tangil which aims to automate malware analysis process based on
DENDROID. The main issue in this regard is the storage of Family Feature Vectors (FFV) which is stored as
sparse matrix. Therefore, this work presents a novel concept of Compressed Row Storage (CRS) to store the
statistical features intellectually. By implementing this methodology, the FFV of Malware families are stored in
an efficient manner. The experimental result proves that the large reduction (79%) in space needed to store
FFV which incorporates only the non-zero elements is observed. This eventually leads to the reduction in the
Feature Vector generation time and the Total process time. The proposed methodology will reduce the
dimensionality and hence the time searching for a particular malware family signature.
Keywords: Android Malware, Text Mining, Statistical Features, Family Feature Vectors, Sparse Matrix,
Compressed Row Storage.
I. Introduction
Smartphone sales growth is rapidly growing across the globe. The worldwide market of Smartphone
has risen up to 27.2% from 2013 to 2014 says International Data Corporation (IDC) [1]. According to its report,
the Smartphone shipments reached 334 million units till the 1st
quarter of 2015 with Android Operating System
(OS) taking the most part (≈ 78%) of the share. Android users, from the past few years, have been increasing
very rapidly due to its open source nature [2]. But the counterpart of this fact is that there has been a continuous
proliferation of Android Malware also.
The Cybercriminals are continuously exploring the vulnerabilities and have now become more creative
in camouflaging their work. According to the Intelligence report published by Symantec Corporation in 2015
[3], There was an average of 39 Android malware variants per family discovered in May, 2015. These include
Malware that steals information to Malware that are Adware. Continuous research work has been proposed to
thwart such attacks. The discipline of analysis and detection of Android malware can be broadly classified into
two categories, namely, Static and Dynamic. The Static analysis aims to detect the anomalous behavior by
converting the executable code back into source code and then covering each and every path in the code. There
are many methodologies pertaining to Static Analysis. One such methodology is the Text Mining. Text Mining
gathers some meaningful information from the source code of app under consideration. Using Text Mining, the
unique signatures and in turn, the behavior of Android Application Package (APK) file can be stored. The
existing methodology used this approach for storing the FFV. This paper proposes an improvement regarding
storage of the Feature Vector. The rest of the paper is summarized as follows. Section II elaborates the
background of the field. Section III covers related work regarding Static Analysis. Further, Section IV presents
the proposed methodology. Next, Section V discusses the implementation results. Lastly, Section VI draws the
conclusion with proposed futuristic work.
II. Background
The background of this paper focuses on the existing methodology that was applied for malware
detection. The technique proposed in [4] focuses on the Text Mining approach. In this context, it is referred to as
retrieval of quality or "intelligent" information about the patterns appearing in the source code of APK file. The
Android Malware Genome Project dataset APK files was used for this purpose which consists of 49 Malware
families with a total of 1,254 APK files in it. The paper applied reverse engineering on the dataset to obtain the
source code using the Google's Androguard open source tool [5]. The Androguard is based on Context Free
Grammar (CFG) [6] through which the signature of Malware family is generated. The signatures are made up of
"Code Chunks" (CC) which are formed by applying the rules of CFG. As Android Applications are written in
Java language, each CC represents a method present in the class. After this, the mathematical analysis was
performed on the signature files to evaluate the following terms:
2. An Improved Feature Vector storage metric for fast Android Malware Detection Framework
DOI: 10.9790/0661-17445260 www.iosrjournals.org 53 | Page
CC(a) – The set of all different CCs found in the app a.
Redundancy R(a) – Measures the fraction of repeated CCs present in an app.
Family Code Chunks FCCFi – Total CCs of a particular family.
Common Code Chunks CCCFi – The set of Common CCs within a particular family.
Fully Discriminant Code Chunks (FDCCFi) – The set of Common CCs of family Fi that are not found in other
families.
The FDCC value of Family signature reveals that it can be used as a component to detect a particular
Malware family. The authors contradicted that as FDCC is fragile and is dependent on CCC, it can't be used for
detection. To solve this problem, Vector Space Modeling (VSM) is applied. In the context of the methodology,
VSM measures the relevance of a particular CC „c‟ in an app 'a'. The Code Chunk Frequency (CCF) measures
the frequency of a code chunk „c‟ in a Family Fi. The Inverse Family Frequency (IFF) of a CC „c‟ measures how
frequently a CC appears in a Family Fi and not in all other Family Fj, where i ≠ j. The CCF * IFF factor is
evaluated and is thus stored as FFV.
The authors built a Java implementation of VSM and trained the system using a dataset of 621 malware
instances from the Android Malware Genome Project (AMGP) dataset. A total of 84,854 unique CCs were
found across all malware families. The system was tested using another dataset of 610 malware instances using
the popular 1- Nearest Neighbor (1-NN) Malware classifier for predicting the family to which it belongs. The
experiment was conducted and 5.74% false positives were found. They also inspected that DroidKungFu was
the major malware and with time its many variants were emerged. The authors also did evolutionary analysis of
Malware Families. As mentioned above, the feature vectors of each family are stored with a dimension of
84,854 making it very sparse. Therefore, the paper proposes a solution to this problem through efficient storage
of FFV.
III. Related Work
A number of automated tools are available for Android Malware analysis. In this regard, the authors
have proposed several research works. A comprehensive study has been done in [7] by reviewing about 100
papers regarding feature selection and categorized the features that can be extracted from the APK file for
malware detection. This includes Static (permissions, Java code, Intent filter, network address, etc.), Dynamic
(system calls, network traffic, User Interface, etc.), Hybrid (includes a combination of static and dynamic
features), and Application metadata (metadata, APK category, rating & description, etc.). In [8], focus is made
on permission based analysis and applied various feature selection methods and classification algorithms. The
key idea they adopted was to consider the permissions requested by an APK and then selecting the features that
best represents the features in the dataset and then apply the suitable classification algorithm. In [9], emphasis is
made on mining the Android permission patterns, by extracting the required permissions of both benign and
malicious applications‟ dataset. The technique also gained knowledge by analyzing the “used” permissions of
various apps. The difference of these permissions identified the anomalies. They used Contrast Permission
Pattern Mining (CPPM) algorithm for contrast detection and malicious app detection. In [10], the signatures are
generated by extracting improbable byte features which are resistant obfuscation and repackaging. The method
generated the entropy features on a byte block window and the normalized most popular features were extracted.
Further, they employed a similarity digest hashing scheme on byte stream based on robust statistical malicious
features. The signatures in [11] are generated by considering the source code as N-gram signatures. The N-gram
is a probabilistic machine learning algorithm that predicts the next item in the sequence with given datasets in
order (N-1) as in Markov Model. After this, a Common Vulnerability Scoring System (CVSS) was employed to
define the level of vulnerability of each feature generated. The behavior of android applications is well governed
by two things: Manifest file and class file was stated in [12]. So, they extracted the features based on both
criteria. The manifest based features include the number of activities, the number of services, the number of
receivers, the list of permissions and broadcasts listening to. The class based features are extracted from opcode
frequencies, 2-gram opcode frequencies and application Programming Interface (API) calls. These parameters
were called as sensitive parameters and the sensitive index was obtained taking these parameters. In [13], the
function call graphs of the APK file is extracted and then employed an explicit mapping to efficiently map call
graphs to an explicit feature space. Then, Support Vector Machine (SVM) was trained to distinguish benign and
malicious applications. In [14], flexible and robust obfuscation and repackaging resistant family signatures is
constructed by composing multiple entries, consisting of four types of binary patterns viz., the class name, the
method name, the character string and the method (bytecode) body. The signatures were stored in hashed format
with associated weight, thereby reducing the signature comparison process time. The unknown sample was
compared for the hash patterns and a similarity detection metric was used for calculating the similarity between
the sample‟s signature and trained dataset‟s signatures. Malicious behavior patterns or models was mined in
[15], which they called modalities (programming logic segments corresponding to known suspicious behavior)
by drawing a two-tiered behavioral graph (Component Dependency and Component Behavior Graph). Within
3. An Improved Feature Vector storage metric for fast Android Malware Detection Framework
DOI: 10.9790/0661-17445260 www.iosrjournals.org 54 | Page
this behavior graph, it automatically identifies modalities and defines a modality vector. For unknown target
app, its modality was constructed and compared with the modality vector for malware detection.
IV. Methodology
For implementing the concept introduced in Section II, the same AMGP dataset was taken [16,17]. The
dataset consists of 33 Malware families with 1,249 samples under them. Another dataset that has considered is
from the contagio [18]. These datasets are referred to as Dataset - 1 and Dataset - 2 respectively in the remaining
context of this paper. The distribution of family wise apps for both the datasets is as follows:
Table 1: The AMGP dataset
Malware Family Sample
Count
Malware Family Sample
Count
ADRD 21 GingerMaster 4
AnserverBot 187 GoldDream 47
Asroot 8 Gone60 9
BaseBridge 122 GPSSMSSpy 6
BeanBot 8 HippoSMS 4
Bgserv 9 Jitfake 1
CoinPirate 1 jSMSHider 16
CruseWin 2 KMin 52
DogWars 1 LoveTrap 1
DroidCoupon 1 NickyBot 1
DroidDeluxe 1 NickySpy 2
DroidDream 14 Pjapps 58
DroidDreamLigh
t
46 Plankton 11
DroidKungFu1 34 RogueLemon 2
DroidKungFu2 30 RogueSPPush 9
DroidKungFu3 309 SMSReplicator 1
DroidKungFu4 96 SndApps 10
DroidKungFuSa
pp
3 Spitmo 1
DroidKungFuUp
date
1 Tapsnake 2
EndOfDay 1 Walkinwat 1
FakeNetFlix 1 YZHC 22
FakePlayer 5 zHash 11
GamblerSMS 1 Zitmo 1
Geinimi 67 Zsone 12
GGTracker 1
Table 2: The contagio dataset
Malware
Family
Sample
Count
Malware
Family
Sample
Count
Airpush 1 LoveTrap 1
AnserverBot 2 NickySpy 3
Bgserv 3 OBada 2
CruiseWin 2 Pincer 4
DogWars 2 PjApps 45
DroidDeuxe 1 Plankton 35
DroidDream 8 Ransom 8
Fakebanker 11 RogueSPush 10
FakeNetflix 2 SMSTrojan 14
FakePlayer 3 SndApps 10
Geinimi 7 Spitmo 2
Gone60 6 TapSnake 3
HippoSMS 2 Titan 2
jSMSHider 17 Zitmo 9
KungFu 311
4.1 Introduction to CRS format
For the objective of storing the 𝑀 𝑥 𝑁 feature matrix in an efficient form, the concept of CRS format
[19] for storing sparse matrices is introduced hereby. This Algorithm makes no assumptions about the sparsity
structure of the matrix. It stores only necessary elements. It can be applied to any 𝑀 𝑥 𝑁 sparse matrix. This
format puts the subsequent non-zero values of the matrix rows in contiguous memory locations. The
𝑀 𝑥 𝑁 matrix is stored and handled through three vectors as:
Value (val): This is a 1 x C vector which contains all the non-zero elements of the matrix in contiguous format
when the elements are traversed in ROW fashion.
Column Index (col_idx): This is also a 1-Dimensional vector which contains the column indexes of the
elements in the Val vector.
Row Pointer (row_ptr): A Single Dimensional vector whose size is (|M|+1), where M is the number of Rows.
The first element of this vector is always initialized with zero. To understand the above format, consider the
following 6 𝑥 5 sparse matrixes A having the elements as:
𝐴 =
12 0 0 0 0 −1
5 0 0 3 0 0
0 1
6 0
0 0
0 0
0 0
0 0
0 23
0 0
0 0
4. An Improved Feature Vector storage metric for fast Android Malware Detection Framework
DOI: 10.9790/0661-17445260 www.iosrjournals.org 55 | Page
The val vector store all the non-zero elements of the matrix. The col_idx vector stores the column
indexes of all the non-zero elements. The row_ptr stores the number of non-zero elements per row. Thus, the
CRS format of the above matrix is depicted by the three vectors as:
val = [12, -1, 5, 3, 1, 23, 6]
col_idx = [1, 5, 1, 4, 2, 6, 1]
row_ptr = [0, 2, 2, 2, 1, 0]
4.2 Discussions
A condition may come when this concept of storing sparse matrices may fail. Let, for example, the size
of each of the three, 1-dimensional vector mentioned above, be „K‟, the rows and columns of the sparse matrix
be „M‟ and „N‟, then if the condition mentioned in the equation (1) below satisfies,
𝟑𝑲 + 𝟏 ≥ 𝑴 × 𝑵 (1)
then, CRS methodology fails to store sparse matrices. This condition comes when; the matrix is not much
sparser. The dataset analyzed contains 33 Malware Families and these malware families possess different code
structures. Thus, the signatures generated by different families, will be mostly mutually exclusive. The Fig. 1
depicts this situation:
Figure 1: The mutually exclusive condition for the intersection of various families
To understand this mathematically, consider the equation (2) given below for evaluating the intersection of all
the CCCs from all the families:
𝑃 𝐹1 ∩ 𝐹2 … … ∩ 𝐹33
= 𝑃 𝐹2 ∩ 𝐹3 … ∩ 𝐹33 ∗ 𝑃 𝐹3 ∩ 𝐹4 … ∩ 𝐹33 … . . 𝑃 𝐹32 ∩ 𝐹33 ∗ 𝑃 𝐹33
∗ (𝑃 𝐹1| 𝐹2 … ∩ 𝐹33 ∗ 𝑃 𝐹2| 𝐹3 … ∩ 𝐹33 ∗ 𝑃 𝐹3| 𝐹4 … ∩ 𝐹33 ∗ 𝑃 𝐹32| 𝐹33 )
(2)
Here, 𝐹1, 𝐹2, … … 𝐹33 represents the CCCs present in their respective family. Consider two malware
families CCCs. Their intersection will give the CCCs that are occurring in both the families represented as
I(CCC1,2). Now, computation of the probability of occurrence of CCC of third family with respect to this
intersected value, i.e. I(CCC1,2 | 1) is required which is represented as I(CCC1,2,3 | 1,2). On generalizing this, the
equation (3) forms as mentioned below:
𝐼(𝐶𝐶𝐶1,2,…..𝑛|1,2,…𝑛−1) (3)
The above equation can be alternatively considered as the conditional probability of the CCCs
occurring among all the families. In the dataset analyzed, 84,854 unique FFV elements pertaining to a Family
were found. There are a total of 33 Malware families. Thus the total number of elements of the matrix is
approximately 28 Lakhs. Out of these elements, only 1.6 Lakhs elements were found to be non-zero which
computes to 5.7%.
The Equation (2) is applied on the CCC values of the dataset and considered for evaluation of the
proposed methodology. The result as per the evaluation was 4.67%. This value can be considered as those
important and concerned CCs relative to all the Malware families. In turn, this value depicts the percentage of
non-zero elements. Our analysis is closer to actual value calculated above. However, it can be assumed that the
upper bound of this value can be greater than 10%. On the basis of the above analysis, it can be concluded that
the condition stated in Equation (1) is likely to be never achieved. Hence, the matrix will always contain sparsity
of at least 90% as per the analysis of dataset. Another issue is the application of the Malware classifier algorithm
to the thus reduced feature vector file. For this, a set of training and testing dataset can be prepared with almost
equal amount of test APK files. The system can be trained with the proposed methodology and the feature
5. An Improved Feature Vector storage metric for fast Android Malware Detection Framework
DOI: 10.9790/0661-17445260 www.iosrjournals.org 56 | Page
vector file is stored in CRS format. The testing dataset's feature vectors are generated in the same manner. Then,
a matching algorithm can be employed that matches its corresponding values of column index and non-zero of
the values with that of training dataset. A suitable threshold match value can be set that classifies the training set
instances to their predicted malware families with a very less false positive rate. The algorithm of the CRS
format for storing sparse matrix is as follows:
Input: M x N values that probably generate Sparse Matrix.
Output: Three 1-D vectors namely: val, col_idx, row_ptr
Procedure: 1
foreach row, i ε [1..M] do –
Initialize countRow to 0
foreach column, j ε [1..N] do –
If ijth element is Non-Zero then do
Add element to val vector
Store its column value i.e. j in col_idx
Update countRow
End if
end for
Add countRow to row_ptr
end for
The overall working of the proposed methodology is shown in Fig. (2) as shown below:
Figure 2: Overall working of existing approach and highlighted proposed approach
V. Evaluation
The original as well as the modified method used the datasets mentioned in Table 1 and 2 and were
executed on a benchmark machine (Intel Core-i5, 4GB RAM) and the experimental results are figured as below.
The size of the Feature Vector file that had been generated using the previous methodology showed a very great
reduction from 11983 KB (11.70 MB) to 2492 KB (2.43 MB). Fig. 3 shows the comparison of the disk space
taken in the storage of FFV using both the methodologies on Dataset - 1.
In addition to above improvement, it was also observed that in the FFV generation phase also, a
speedup has been achieved. For Dataset - 1, there were in total, 15 runs that were performed on the benchmark
machine, to calculate the difference in feature generation time and total process time when both the
methodologies were applied. Among that runs, the minimum and the maximum time taken by the existing
methodology for feature generation time was between 0.8 seconds and 1.7 seconds respectively, while that of
the proposed methodology, it was computed to be 0.42 seconds and 1.7 seconds respectively. The two boundary
outlier values were discarded while calculating the results. The five data values with a variation of 13% are
plotted in which the existing methodology took 1.17 to 1.42 seconds while that of proposed methodology took
0.78 to 1.09 seconds. The results are shown in Fig. 4. For the whole process time, the five values with the
existing methodology stood between 49.92 to 51.87 seconds while that of the proposed methodology between
48.73 to 49.81 seconds on Dataset - 1 only as shown in Fig. 5.
6. An Improved Feature Vector storage metric for fast Android Malware Detection Framework
DOI: 10.9790/0661-17445260 www.iosrjournals.org 57 | Page
For confirming the efficiency of the proposed methodology explained above, the methodology is
applied to a new training dataset, collected from Contagio Mobile malware mini dump [19]. This dataset is
named as Dataset - 2. The entire password protected APKs were extracted using a predefined password and
7. An Improved Feature Vector storage metric for fast Android Malware Detection Framework
DOI: 10.9790/0661-17445260 www.iosrjournals.org 58 | Page
then arranged Malware family wise. The Dataset - 2 consist of 29 Malware families with 526 apps. The same
methodology proposed above was applied to this dataset. The various results, expected, are explained as
follows. The size of the feature file that had been generated using the previous methodology showed a reduction
from 6,850 KB (6.69 MB) to 1,382 KB (1.35 MB) as shown in Fig. 6. The same amounts of runs, i.e. 15 runs,
were performed with this dataset also. The five data values with the existing methodology took 0.69 to 0.75
seconds while that of proposed methodology took 0.36 to 0.55 seconds of the time required for feature vector
generation phase as shown in Fig. 7. Next, the total time for the whole process of with the existing Methodology
took around 51.56 to 52.99 seconds while that of the proposed methodology took 49.02 to 50.69 seconds only as
shown in Fig. 8:
8. An Improved Feature Vector storage metric for fast Android Malware Detection Framework
DOI: 10.9790/0661-17445260 www.iosrjournals.org 59 | Page
Table 3: Summarized statistics of improved results of proposed methodology over existing methodology
Dataset considered Disk Space FFV generation time Total process time
Dataset - 1 79.20% 31.54% 2.66%
Dataset - 2 79.82% 37.50% 4.38%
Table 3 illustrates the summarized statistics of improved results of proposed methodology over existing
methodology on both the datasets. It is clear from the above results that by implementing the proposed
methodology, reduction in the disk space as well as speedup in terms of time has been achieved. The removal of
sparsity in the feature vector matrix considerably reduces the size of FFV file. The family feature generation
time is comparatively reduced. Although the total process time does not showed much reduction, but still some
speedup has been achieved. The total process time needs much improvement.
VI. Conclusions And Future Scope
In this paper, a solution to the storage of family feature vectors is presented in an efficient manner as
compared to the previous methodology. The two-dimensional sparse matrix is now represented by three 1-D
vectors incorporating only non-zero elements. The above proposed technique reduces the space in which feature
vectors can be stored as well as the time in which feature vectors are generated. It is clear from the above results
that by implementing the proposed methodology, reduction in the disk space as well as speedup in terms of time
has been achieved. The removal of sparsity in the feature vector matrix considerably reduces the size of FFV
file. The family feature generation time is comparatively reduced. Although the total process time does not
showed much reduction, but still some speedup has been achieved. The whole process time needs much
improvement. The future work comprises of the following aspects:
The proposed methodology can be extended for classification of an unknown sample. The test dataset's
FFV can be generated in this manner and then the predicted family can be decided on the basis of non-zero
values and column pointers respectively.
Measures to defeat obfuscation are not implemented. Some attackers use this art that consists of junk code
insertions, string encryption so that the signature of the malware gets changed and it gets undetected by
anti-malware.
With the advent of new malware being detected, the signatures and feature vector gets changed. Thus, the
database should be in synchronization with the new malware signatures. This can be achieved through
communicating with various malware repositories regularly.
Support Vector Machines (SVM) can be used for detection of unknown sample instead of the 1-NN
classifier algorithm being used in the existing methodology. Once SVM classifier is trained on an
imbalanced dataset, it has the capability to produce suboptimal models which are biased towards the
majority class and have low performance on the minority class [20].
Lastly, the whole methodology is not available online. A web version of this technique can be developed so
that it can be accessed online where anyone can upload the sample APK to be analyzed and can get the
result quickly.
References
[1] “IDC”, “Smartphone OS Market Share, Q1 2015”, http://www.idc.com/prodserv/smartphone-os-market-share.jsp
[2] Khalid Alfalqi, Rubayyi Alghamdi and Mofareh Waqdan: Android Platform Malware Analysis, (IJACSA) International Journal of
Advanced Computer Science and Applications, Vol. 6, No. 1, 2015.
[3] Android Introduction: Platform Overview, Mihail L. Sichitiu, 2011 SECURITY RESPONSE: Mobile Adware and Malware
Analysis: Symantec 2013.
[4] Guillermo Suarez-Tangil, Juan E. Tapiador, Pedro Peris-Lopez and Jorge BlascoAlis: Dendroid: A Text Mining Approach to
Analyzing and Classifying Code Structures in Android Malware Families. Expert Systems with Applications, Elsevier, July 2013.
[5] androguard, Reverse engineering, Malware and goodware analysis of Android applications ... and more (ninja!),
https://code.google.com/p/androguard/.
[6] Cesare, S. and Xiang, Y.: Classification of malware using structured control flow. Proceedings of the eighth Australasian
symposium on parallel and distributed computing (Vol. 107, pp. 61–70). Australian Computer Society, Inc, 2010.
[7] Ali Feizollah et al.: A review on feature selection in mobile malware detection, Volume 13, Elsevier, June 2015.
[8] Veelasha Moonsamy, Jia Rong and Shaowu Liu: Mining permission patterns for contrasting clean and malicious android
applications, Future Generation for Computer Systems, Elsevier, 2013.
[9] Michael Spreitzenbarth, Florian Echtler and Johannes Hoffmann: Mobile-Sandbox: Having a Deeper Look into Android
Applications, In SAC‟13, Coimbra Portugal, ACM, 2013.
[10] Iker Burguera, Urko Zurutuza and Simin Nadjm-Tehrani: Crowdroid: Behavior-Based Malware Detection System for Android In
SPSM‟11, Chicago, Illinois, USA, ACM, 2011.
[11] Parvez Faruki, Vijay Ganmoor, Vijay Laxmi, M. S. Gaur and Ammar Bharmal: AndroSimilar: Robust Statistical Feature Signature
for Android Malware Detection, In SIN ‟13 Aksaray Turkey, ACM, 2013.
[12] R.Dhaya and M. Poongodi: Detecting Software vulnerabilities using Static Analysis, IEEE ICACCCT, 2014.
[13] Samaneh Hosseini Moghaddam, Maghsood Abbaspour: Sensitivity Analysis of Static Features for Android Malware Detection, In
22nd Iranian Conference on Electrical Engineering (ICEE 2014), May 20-22, 2014
9. An Improved Feature Vector storage metric for fast Android Malware Detection Framework
DOI: 10.9790/0661-17445260 www.iosrjournals.org 60 | Page
[14] Hugo Gascon, Fibian Yamaguchi, Daniel Arp and Konard Rieck: Structural Detection of Android Malware using Embedded Call
Graphs, In AlSec‟13. ACM, 2013
[15] Yajin Zhou and Xuxian Jiang: Dissecting Android Malware: Characterization and Evolution, 2012 IEEE Symposium on Security
and Privacy.
[16] X. Jiang and Y. Zhou: Chapter 2 A Survey of Android Malware: Android Malware, Springer Briefs in Computer Science.
[17] "contagio mobile", "mobile malware mini dump", http://contagiominidump.blogspot.in/
[18] Shahadat Hossain: Data Structure for efficient storage of Sparse Matrices.
[19] Rukshan Batuwita and Vasile Palade: CLASS IMBALANCE LEARNING METHODS FOR SUPPORT VECTOR MACHINES,
Copyright 2012 John Wiley & Sons, Inc.
[20] Jehyun Lee, Suyeon Lee, Heejo Lee: Screening Smartphone applications using malware family signatures, Elsevier, 2015
[21] Ugur PEHLIVAN, Nuray BALTACI, Cengiz ACARTÜRK, Nazife BAYKAL: The Analysis of Feature Selection Methods and
Classification Algorithms in Permission Based Android Malware Detection, In Computational Intelligence in Cyber Security
(CICS), 2014 IEEE Symposium, Dec 2014
[22] Hieu Le Thanh: Analysis of Malware Families on Android Mobiles: Detection Characteristics Recognizable by Ordinary Phone
Users and How to Fix It: Journal of Information Security, 2013, 4, 213-224, October 2013.
[23] Zarni Aung and Win Zaw: Permission based Malware analysis, In INTERNATIONAL JOURNAL OF SCIENTIFIC &
TECHNOLOGY RESEARCH VOLUME 2, ISSUE 3, MARCH 2013.
[24] Egele, M., Scholte, T., Kirda, E., and Kruegel, C.: A survey on automated dynamic malware-analysis techniques and tools. ACM
Comput. Surv. 44, 2, Article 6th February 2012.
[25] Parvez Faruki et al.: Android Security: A Survey of Issues, Malware Penetration and Defenses, IEEE COMMUNICATIONS
SURVEYS AND TUTORIALS, VOL. 00, NO. 0, JANUARY 2015.
[26] Timothy Vidas & Nicolas Christin: Evading Android Runtime Analysis via Sandbox Detection, ASIA CCS‟14, June 4–6, 2014,
Kyoto, Japan.
[27] Minakshi Ramteke, Prof. Praveen Sen and Suchit Sapate: Comparative Study and a Survey on Malware Analysis Approaches for
Android Devices, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, 3rd
March 2014.
[28] Chao Yang et al.: DroidMiner: Automated Mining and Characterization of Fine-grained Malicious Behaviors in Android
Applications, 19th European Symposium on Research in Computer Security, Wroclaw, Poland, September 7-11, 2014. Proceedings,
Part I, Copyright Springer
[29] Guillermo Suarez-Tangil et al.: Thwarting Obfuscated Malware via Differential Fault Analysis, Volume: 47, Issue: 6, 2014 IEEE.
[30] Muazzam Siddiqui, Morgan C. Wang and Joohan Lee: A Survey of Data Mining Techniques for Malware Detection using File
Features, In ACM-SE '08, March 28-29, 2008.