Malware analysis

www.robertodaguarcino.com
Master of Engineering in Computer Science
Machine Learning for Malware Analysis
Bayesian approach for malwares classification by family
Roberto Falconi

Summary
1. Abstract...............................................................................................................................3
2. Why I used Python..............................................................................................................3
3. Program setup.....................................................................................................................4
3.1 Windows setup............................................................................................................4
3.2 Linux or macOS setup..................................................................................................4
4. Dataset preparation............................................................................................................5
4.1 Drebin dataset .............................................................................................................5
4.2 Data pre-processing.....................................................................................................6
4.3 From nominal data to numeric data ...........................................................................7
4.4 Integer Encoding..........................................................................................................8
4.5 One-Hot Encoder.........................................................................................................9
5. Overfitting and underfitting avoidance ............................................................................10
5.1 Bias and Variance ......................................................................................................10
5.2 Training set and test set ............................................................................................10
5.3 Cross-validation .........................................................................................................11
6. The classification problem ................................................................................................12
6.1 Random Forest ..........................................................................................................12
6.2 SVM............................................................................................................................12
6.3 Classify malwares by family.......................................................................................12
6.4 Binary classifiers ........................................................................................................13
6.5 From binary to multiclass..........................................................................................14
6.6 One vs All and One vs One.........................................................................................14
6.7 Accuracy score...........................................................................................................15
6.8 Confusion Matrix .......................................................................................................16
6.9 Precision score...........................................................................................................16
6.10 Recall score ............................................................................................................17
6.11 F1 score..................................................................................................................18
7. Final results.......................................................................................................................18

1. Abstract
Goal of this project is to understand the data contained in the DREBIN dataset to define a
classification problem for malware analysis (target function).
I decided to make a multiclass classification: I classified all malwares by family.
First, I classified malwares by family using binary classifiers for each family.
Second, I calculated the probability that a malware really belongs to some family and finally,
with One vs All and other methodologies, I came back to the multiclass problem to classify
the malware by family.
The evaluation procedure and the results are described in the following report and it
includes (but it is not limited to) dataset preprocessing, integer encoding, one-hot encoder,
overfitting avoidance using dataset splitting into training set and test set and cross
validation, the usage of classifiers such as Random Forest and Support Vector Machine (also
known as SVM or Support Vector Classification and SVC in Scikit-learn), and the coming back
from binary to multiclass problem using One vs All and finally the calculation of confusion
matrix, accuracy, misclassification rate, precision, recall, f1 and others (each described and
argued).
2. Why I used Python
First, I want to explain why I used Python to develop this project.
Python is popular in machine learning because of many inter-related reasons: it is simple,
elegant, consistent, and math-like.
Python code has been described as readable pseudocode. It is easy to pick up due to its
consistent syntax and the way it mirrors human language and/or their mathematical
counterparts.

It is math-like in that some "objects" that are very much part of a mathematician's
vocabulary are part of the language without you having to install / import them, and they
resemble their equivalent mathematical counterparts. With carefully chosen
variable/function names, the code can be read like math or English, because you simply
don’t need to declare the type of a variable or to manually cast them.
The latter (much due to libraries such as Pandas, NumPy and Scikit-learn) is something one
will appreciate if he were to implement a machine learning algorithm of which the core is
likely just mathematical optimization.
3. Program setup
3.1 Windows setup
1. Download Visual Studio Code
https://code.visualstudio.com/Download
2. Install Python plugin for VS Code
https://marketplace.visualstudio.com/items?itemName=ms-python.python
3. Download and install Python 3.7.1 64 bit for Window
https://www.python.org/downloads/release/python-371
(It is important to mark "Add Python 3.7 to PATH" and "Disable max path length" at
the end of the setup)
4. Open VS Code terminal (ctrl + ò)
5. In the terminal, type the "pip install pandas" command
6. Again, type "pip install sklearn" command
7. Finally, "py <homeworkpath.py>" to run the code and read the printed results
described and argued in this report in the following chapters.
3.2 Linux or macOS setup
As the same way of the Windows setup, it is important to download Python at 64 bit (adding
it to the path and disabling the max path length), then, you can run it in an IDE like Visual

Studio or just in the terminal using the same commands, but it is important to replace the
“pip” command with “sudo pip3” and “py” with “sudo python3”.
4. Dataset preparation
4.1 Drebin dataset
As the limited resources impede monitoring applications at run-time, DREBIN performs a
broad static analysis, gathering as many features of an application as possible. These features
are embedded in a joint vector space, such that typical patterns indicative for malware can be
automatically identified and used for explaining the decisions of our method.
In an evaluation with 123,453 applications and 5,560 malware samples DREBIN outperforms
several related approaches and detects 94% of the malware with few false alarms, where the
explanations provided for each detection reveal relevant properties of the detected malware.
DREBIN performs a broad static analysis, gathering as many features from an application’s
code and manifest as possible. These features are organized in sets of strings (such as
permissions, API calls and network addresses) and embedded in a joint vector space.
As an example, an application sending premium SMS messages is cast to a specific region in
the vector space associated with the corresponding permissions, intents and API calls.

To foster research in the area of malware detection and to enable a comparison of different
approaches, we make the malicious Android applications used in our work as well as all
extracted feature sets available to other researchers in the DREBIN dataset.
4.2 Data pre-processing
It is possible to analyze malware and to classify them in families with a Machine Learning
program. In this case I chosen to use Python as explained above.
Data preprocessing involves data preparation and dividing the data into training and testing
sets.
Drebin sha256_family.csv has been loaded as the dataset and so it has been used to data pre-
processing and data analysis through some methodologies and techniques. The dataset has
two elements: sha256 string (file name) and its own family, but I take only families with more
than 20 elements, while all the other families (and their own elements) have been deleted.
This, because families with less then 20 elements can be statistically irrelevant, or worse.

Features of the files were loaded into Design Matrix 𝑿 ∈ 𝑅 𝑁×𝐷
and the output in 𝒚 ∈ 𝑅 𝑁
,
𝑦𝑖 ∈ {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, 𝑒𝑡𝑐. }:
𝑿 = [
𝑥11 ⋯ 𝑥1𝑛
⋮ ⋱ ⋮
𝑥 𝑛1 ⋯ 𝑥 𝑛𝑛
]
𝒚 = (
𝑦1
⋮
𝑦𝑛
)
4.3 From nominal data to numeric data
Nominal data (or categorical data) are variables that contain label values rather than numeric
values. The number of possible values is often limited to a fixed set. Categorical variables are
often called nominal.
Some examples include:
A “pet” variable with the values: “dog” and “cat“.
A “color” variable with the values: “red“, “green” and “blue“.
A “place” variable with the values: “first”, “second” and “third“.
Each value represents a different category.
Some categories may have a natural relationship to each other, such as a natural ordering.
The “place” variable above does have a natural ordering of values. This type of categorical
variable is called an ordinal variable.
What is the Problem with Categorical Data?
Some algorithms can work with categorical data directly.

For example, a decision tree can be learned directly from categorical data with no data
transform required (this depends on the specific implementation).
Many machine learning algorithms cannot operate on label data directly. They require all
input variables and output variables to be numeric.
In general, this is mostly a constraint of the efficient implementation of machine learning
algorithms rather than hard limitations on the algorithms themselves.
This means that categorical data must be converted to a numerical form. If the categorical
variable is an output variable, you may also want to convert predictions by the model back
into a categorical form in order to present them or use them in some application.
How to Convert Categorical Data to Numerical Data?
This involves two steps: integer encoding and One-Hot Encoder.
4.4 Integer Encoding
As a first step, each unique category value is assigned an integer value.
For example, “red” is 1, “green” is 2, and “blue” is 3.
This is called a label encoding or an integer encoding and is easily reversible.
For some variables, this may be enough. The integer values have a natural ordered
relationship between each other and machine learning algorithms may be able to understand
and harness this relationship.
For example, ordinal variables like places would be a good example where a label/integer
encoding would be enough.

4.5 One-Hot Encoder
For categorical variables where no such ordinal relationship exists, the integer encoding is not
enough.
In fact, using this encoding and allowing the model to assume a natural ordering between
categories may result in poor performance or unexpected results (predictions halfway
between categories).
In this case, a one-hot encoder can be applied to the integer representation. This is where the
integer encoded variable is removed and a new binary variable is added for each unique
integer value.
In the “color” variable example, there are 3 categories and therefore 3 binary variables are
needed. A “1” value is placed in the binary variable for the color and “0” values for the other
colors.
For example, for three different color elements:
red green blue
1 0 0
0 1 0
0 0 1
The binary variables are often called “dummy variables” in other fields, such as statistics.
In our case, I have applied this last technique on all the features, including the families and
the other inside the features’ vectors.

5. Overfitting and underfitting avoidance
5.1 Bias and Variance
When evaluating a machine learning model, it is important to balance Bias and Variance.
High Bias refers to a scenario where the model is “underfitting” the dataset. This is bad
because the model is not presenting a very accurate or representative picture of the
relationship between inputs and predicted output and is often outputting high error.
High Variance represents the opposite scenario. In cases of High Variance or “overfitting”,
machine learning model is so accurate that it is perfectly fitted to your example dataset. While
this may seem like a good outcome, it is also a cause for concern, as such models often fail to
generalize to future datasets. So, while the model works well for existing data, it is not known
how well it will perform on other examples.
5.2 Training set and test set
Learning the parameters of a prediction function and testing it on the same data is a
methodological mistake: a model that would just repeat the labels of the samples that it has
just seen would have a perfect score but would fail to predict anything useful on yet-unseen
data. This situation is called overfitting. To avoid it, it is common practice when performing a
(supervised) machine learning experiment to hold out part of the available data as a test set
X_test, y_test. Note that the word “experiment” is not intended to denote academic use only,
because even in commercial settings machine learning usually starts out experimentally.
In scikit-learn a random split into training and test sets can be quickly computed with the
train_test_split helper function.

5.3 Cross-validation
When evaluating different settings (“hyperparameters”) for estimators, such as the C setting
that must be manually set for an SVM, there is still a risk of overfitting on the test set because
the parameters can be tweaked until the estimator performs optimally. This way, knowledge
about the test set can “leak” into the model and evaluation metrics no longer report on
generalization performance. To solve this problem, yet another part of the dataset can be held
out as a so-called “validation set”: training proceeds on the training set, after which evaluation
is done on the validation set, and when the experiment seems to be successful, final
evaluation can be done on the test set.
However, by partitioning the available data into three sets, we drastically reduce the number
of samples which can be used for learning the model, and the results can depend on a
particular random choice for the pair of (train, validation) sets.
A solution to this problem is a procedure called cross-validation (CV for short). A test set
should still be held out for final evaluation, but the validation set is no longer needed when
doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets
(other approaches are described below, but generally follow the same principles). The
following procedure is followed for each of the k “folds”:
A model is trained using of the folds as training data;
the resulting model is validated on the remaining part of the data (i.e., it is used as a test set
to compute a performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values
computed in the loop. This approach can be computationally expensive but does not waste
too much data (as is the case when fixing an arbitrary validation set), which is a major
advantage in problems such as inverse inference where the number of samples is very small.
The simplest way to use cross-validation is to call the cross_val_score helper function on the
estimator and the dataset, but I have used also the cross_val_predict method in order to come
back from a binary to a multiclass problem.

6. The classification problem
6.1 Random Forest
Random Forest is intrinsically suited for multiclass problems. It works well with a mixture of
numerical and categorical features. When features are on the various scales, it is also fine.
Roughly speaking, with Random Forest you can use data as they are. As a consequence, one-
hot encoder for categorical features is a must-do. Further, min-max or other scaling is highly
recommended at preprocessing step. At last, for a classification problem Random Forest
gives you probability of belonging to a class.
6.2 SVM
SVC and LinearSVC are classes capable of performing multi-class classification on a dataset.
LinearSVC is an implementation of Support Vector Classification for the case of a linear kernel.
SVC implements the One vs One approach for multiclass classification. On the other hand,
LinearSVC implements One vs Rest multiclass strategy, thus training n class models.
Both One vs One and One vs Rest strategies are discussed in the next chapters.
One-hot encoder discretization, in a first time, lead us to multiple binary classifiers instead of
a single multiclass classifier.
𝒚 = 𝒘 𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛 + 𝒘 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟 + ⋯
𝑤ℎ𝑒𝑟𝑒:
𝒘 𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛 ∈ {0,1}
𝒘 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟 ∈ {0,1}
⋮
6.3 Classify malwares by family
Running binary classifier for each family return partial functions:

𝑓𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝑓𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, …
With Scikit-learn predict method we got:
𝒚̃𝑖, 𝑤ℎ𝑒𝑟𝑒 𝑖 = {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … },
𝒚̃𝑖 = 𝑦̃1, … , 𝑦̃ 𝑛,
𝑦̃1, … , 𝑦̃ 𝑛 ∈ {0,1}
6.4 Binary classifiers
Scikit-learn provides methods in the sklearn.metrics module such as accuracy_score,
precision_score, recall_score and f1_score (discussed later). These methods print
misclassifications and accuracy, precision, recall and f1 scores of every binary classifier of
each family.
In our case, we have good results for each binary classifier (discussed and argued in the last
chapter), but it is not enough to understand if the results are so good as they seem to be,
because at this moment we are considering each single classifier and not them all, nor the
probability that an element belongs to the predicted family or to the others.
When performing classification, you often want to predict not only the class, but also the
associated probability. This probability gives you some kind of confidence on the prediction.
However, not all classifiers provide well-calibrated probabilities, some being over-confident
while others being under-confident. Thus, a separate calibration of predicted probabilities is
often desirable as a postprocessing.
In the next chapters, I will go deep the various scores analyzing them re-calculated using
predict_proba method instead of the predict method, normalizing the probability and
applying One vs All methodology.

6.5 From binary to multiclass
Some metrics are essentially defined for binary classification tasks. In these cases, by default
only the positive label is evaluated, assuming by default that the positive class is labelled 1
(though this may be configurable through the pos_label parameter).
In extending a binary metric to multiclass problems, the data is treated as a collection of binary
problems, one for each class. There are then several ways to average binary metric
calculations across the set of classes, each of which may be useful in some scenario. Where
available, Scikit-learn suggests us to select among these using the average parameter: "macro"
simply calculates the mean of the binary metrics, giving equal weight to each class. In
problems where infrequent classes are nonetheless important, macro-averaging may be a
means of highlighting their performance. On the other hand, the assumption that all classes
are equally important is often untrue, such that macro-averaging will over-emphasize the
typically low performance on an infrequent class; "weighted" accounts for class imbalance by
computing the average of binary metrics in which each class’s score is weighted by its
presence in the true data sample.
6.6 One vs All and One vs One
One vs All (also known as One vs Rest) strategy involves training a single classifier per class,
with the samples of that class as positive samples and all other samples as negatives. This
strategy requires the base classifiers to produce a real-valued confidence score for its
decision, rather than just a class label; discrete class labels alone can lead to ambiguities,
where multiple classes are predicted for a single sample. In the One vs One reduction, one
trains K (K − 1) / 2 binary classifiers for a K-way multiclass problem; each receives the
samples of a pair of classes from the original training set and must learn to distinguish these
two classes.
One vs All and One vs One methodologies bring us to the original problem of multiclass
classification.

Making decisions means applying all classifiers to an unseen sample x and predicting the class
k for which the corresponding classifier reports the highest confidence score:
𝑦̂ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑓𝑘(𝑥), 𝑘 ∈ {1, … , 𝐾}
Thanks to predict_proba method of Scikit-learn we have:
𝒚̂𝑖, 𝑖 = {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … },
𝒚̂𝑖 = 𝑦̂1, … , 𝑦̂ 𝑛,
𝑦̂1, … , 𝑦̂ 𝑛 ∈ [0,1]
that represent this confidence score in a probability status.
To normalize the partial results (∑ 𝒚̅𝑖 𝑖
= 1):
𝒚̅𝑖, 𝑖 = {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … },
𝒚̅𝑖 =
𝒚̂𝑖
𝒚̂ 𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛 + 𝒚̂ 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟 + ⋯
At last, we can apply the majority rule with a threshold of 0.50 confidence score:
𝐼𝑓 𝒚̅𝑖 > 0.50 ⇒ 𝒚̅𝑖 ∈ 𝑖, 𝑖 = {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … }
else, it goes to a misclassification.
Once proceeds the confidence score (or misclassifications) for every family, we can calculate
scores to understand if the classification can be good enough. In the next paragraphs there
will be discussed each used score and in the final chapter it will be analyzed my code’s results.
6.7 Accuracy score
The accuracy_score function computes the accuracy, either the fraction (default) or the count
(normalize=False) of correct predictions. If 𝑦̂𝑖 is the predicted value of the 𝑖-th sample and 𝑦𝑖
is the corresponding true value, then the fraction of correct predictions over 𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 is
defined as

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(𝑦, 𝑦̂) =
1
𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
∑ 1(𝑦̂𝑖 = 𝑦𝑖)
𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠−1
𝑖=0
Where 1(𝑥) is the indicator function (function defined on a set X that indicates membership
of an element in a subset A of X, having the value 1 for all elements of A and the value 0 for
all elements of X not in A).
Is accuracy score enough? No. Accuracy is not the be-all and end-all model metric to use when
selecting the best model. When performing classification, one often wants to predict not only
the class label, but also the associated probability. This probability gives confidence on the
prediction.
6.8 Confusion Matrix
Compute confusion matrix to evaluate the accuracy of a classification.
By definition a confusion matrix 𝐶 is such that 𝐶𝑖,𝑗 is equal to the number of observations
known to be in group 𝑖 but predicted to be in group 𝑗.
Thus, in binary classification, the count of true negatives is 𝐶0,0, false negatives is 𝐶1,0, true
positives is 𝐶1,1 and false positives is 𝐶0,1.
6.9 Precision score
Precision is the probability that a (randomly selected) retrieved document is relevant.
The precision is intuitively the ability of the classifier not to label as positive a sample that is
negative. The best value is 1 and the worst value is 0.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

Immediately, you can see that Precision talks about how precise/accurate your model is out
of those predicted positive, how many of them are actual positive.
Precision is a good measure to determine, when the costs of False Positive is high. For
instance, email spam detection. In email spam detection, a false positive means that an email
that is non-spam (actual negative) has been identified as spam (predicted spam). The email
user might lose important emails if the precision is not high for the spam detection model.
6.10 Recall score
Recall is the probability that a (randomly selected) relevant document is retrieved in a
search.
The recall is intuitively the ability of the classifier to find all the positive samples. The best
value is 1 and the worst value is 0.
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
So, Recall actually calculates how many of the Actual Positives our model capture through
labeling it as Positive (True Positive).
Applying the same understanding, we know that Recall shall be the model metric we use to
select our best model when there is a high cost associated with False Negative.
For instance, in fraud detection or sick patient detection.
If a fraudulent transaction (Actual Positive) is predicted as non-fraudulent (Predicted
Negative), the consequence can be very bad for the bank.
Similarly, in sick patient detection. If a sick patient (Actual Positive) goes through the test and
predicted as not sick (Predicted Negative). The cost associated with False Negative will be
extremely high if the sickness is contagious.

6.11 F1 score
F1 score, also known as balanced F-score or F-measure, can be interpreted as a weighted
average of the precision and recall, where an F1 score reaches its best value at 1 and worst
value at 0. The relative contribution of precision and recall to the F1 score are equal. In the
multi-class and multi-label case, this is the average of the F1 score of each class with weighting
depending on the average parameter.
𝐹1 = 2 ×
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
F1 Score is needed when you want to seek a balance between Precision and Recall. We have
previously seen that accuracy can be largely contributed by a large number of True Negatives
which in most business circumstances, we do not focus on much whereas False Negative and
False Positive usually has business costs (tangible & intangible) thus F1 Score might be a better
measure to use if we need to seek a balance between Precision and Recall and there is an
uneven class distribution (large number of Actual Negatives).
7. Final results
After reasoned and argued all my program and the procedures I followed to solve the
multiclass classification problem, I am going to write into this chapter results and
consequences of what I discovered and why.
All binary classifiers, applied on each family, have very good scores, including accuracy,
balanced accuracy, misclassification rate, recall, precision and f1.
Here are the results of some family, using a 3-fold cross validation with Random Forest.
Family #0: GinMaster
accuracy: [0.98825372 0.99451411 0.99137255]

balanced accuracy: [0.90555556 0.97180064 0.96067416]
misclassification: 8
recall: [0.98747063 0.98981191 0.9945098 ]
precision: [0.99146515 0.99446995 0.99369643]
f1: [0.98285934 0.99449994 0.9903826 ]
Family # 1: FakeInstaller
accuracy: [0.97137931 0.97059561 0.9776489 ]
recall: [0.97137931 0.97059561 0.9776489 ]
precision: [0.97057825 0.96662691 0.97764725]
f1: [0.97213876 0.97058114 0.9784326 ]
Family # 2: Plankton
accuracy: [0.9784326 0.9784326 0.9792163]
recall: [0.9784326 0.9784326 0.9853673 ]
precision: [0.97921701 0.97922078 0.98145464 ]
f1: [0.97921725 0.97606713 0.98473473 ]
…
Family # 23: Boxer
accuracy: [0.97373041 0.97529781 0.97373041]
recall: [0.97451411 0.97529781 0.97373041]
precision: [0.96905831 0.97257331 0.97165532]
f1: [0.9713867 0.97299564 0.97376927]

…
I am not going to print out results of SVM classifier for each family, because it is not so
important for our multiclass classification purposes, so it follows multiclass classification
problems results for both Random Forest and SVM after and before some consideration.
Then, we can come back to the multiclass classification to analysis its results.
The script I used to test the classifier implemented cross-validation and many other
techniques to avoid overfitting and to balance bias and variance. I was skeptical of the
relatively high precision, recall, and F1 score recorded of the single binary classifiers, but
looking through the script, I saw that the random seed for the cross-validation split was set at
some value in order to generate reproducible results. I changed the random seed and sure
enough, the performance of my model decreased. Therefore, I must have made the classic
mistake of overfitting on my training set for the given cross-validation random seed and
changed the seed again. Taking all of these precautions against overfitting, I had optimized
my model for a nonspecific set of data. In order to get a better indicator of the performance
of the Decision Tree model, I ran 10 tests with different random seeds and found the average
performance metrics. The final results for my model are summarized below:
Random Forest:
Average Accuracy Precision Recall F1
weighted 0.77 0.97 0.77 0.86
micro 0.76 0.76 0.76 0.76
macro 0.78 0.94 0.66 0.75
Linear SVC:
Average Accuracy Precision Recall F1
weighted 0.82 0.87 0.84 0.86
micro 0.81 0.81 0.81 0.81
macro 0.81 0.86 0.84 0.85

The results are good given the nature of the classification. We can instantly notice that SVM
has quite better results than Random Forest.
Due to the small size of the available data, even a minor change such as altering the random
seed when doing a train-test split can have significant effects on the performance of the
algorithm which must be accounted for by testing over many subsets and calculating the
average performance.
This project points out and highlights the importance and goodness of machine learning while
trying to classify complex, intricate and artificial objects such as a malware, an activity that is
almost always too difficult (if not simply impossible) to be done by humans independently.
Roberto Falconi

Malware analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Malware analysis

Similar to Malware analysis (20)

More from Roberto Falconi

More from Roberto Falconi (14)

Recently uploaded

Recently uploaded (20)

Malware analysis