SlideShare a Scribd company logo
www.robertodaguarcino.com
Master of Engineering in Computer Science
Machine Learning for Malware Analysis
Bayesian approach for malwares classification by family
Roberto Falconi
www.robertodaguarcino.com
Summary
1. Abstract...............................................................................................................................3
2. Why I used Python..............................................................................................................3
3. Program setup.....................................................................................................................4
3.1 Windows setup............................................................................................................4
3.2 Linux or macOS setup..................................................................................................4
4. Dataset preparation............................................................................................................5
4.1 Drebin dataset .............................................................................................................5
4.2 Data pre-processing.....................................................................................................6
4.3 From nominal data to numeric data ...........................................................................7
4.4 Integer Encoding..........................................................................................................8
4.5 One-Hot Encoder.........................................................................................................9
5. Overfitting and underfitting avoidance ............................................................................10
5.1 Bias and Variance ......................................................................................................10
5.2 Training set and test set ............................................................................................10
5.3 Cross-validation .........................................................................................................11
6. The classification problem ................................................................................................12
6.1 Random Forest ..........................................................................................................12
6.2 SVM............................................................................................................................12
6.3 Classify malwares by family.......................................................................................12
6.4 Binary classifiers ........................................................................................................13
6.5 From binary to multiclass..........................................................................................14
6.6 One vs All and One vs One.........................................................................................14
6.7 Accuracy score...........................................................................................................15
6.8 Confusion Matrix .......................................................................................................16
6.9 Precision score...........................................................................................................16
6.10 Recall score ............................................................................................................17
6.11 F1 score..................................................................................................................18
7. Final results.......................................................................................................................18
www.robertodaguarcino.com
1. Abstract
Goal of this project is to understand the data contained in the DREBIN dataset to define a
classification problem for malware analysis (target function).
I decided to make a multiclass classification: I classified all malwares by family.
First, I classified malwares by family using binary classifiers for each family.
Second, I calculated the probability that a malware really belongs to some family and finally,
with One vs All and other methodologies, I came back to the multiclass problem to classify
the malware by family.
The evaluation procedure and the results are described in the following report and it
includes (but it is not limited to) dataset preprocessing, integer encoding, one-hot encoder,
overfitting avoidance using dataset splitting into training set and test set and cross
validation, the usage of classifiers such as Random Forest and Support Vector Machine (also
known as SVM or Support Vector Classification and SVC in Scikit-learn), and the coming back
from binary to multiclass problem using One vs All and finally the calculation of confusion
matrix, accuracy, misclassification rate, precision, recall, f1 and others (each described and
argued).
2. Why I used Python
First, I want to explain why I used Python to develop this project.
Python is popular in machine learning because of many inter-related reasons: it is simple,
elegant, consistent, and math-like.
Python code has been described as readable pseudocode. It is easy to pick up due to its
consistent syntax and the way it mirrors human language and/or their mathematical
counterparts.
www.robertodaguarcino.com
It is math-like in that some "objects" that are very much part of a mathematician's
vocabulary are part of the language without you having to install / import them, and they
resemble their equivalent mathematical counterparts. With carefully chosen
variable/function names, the code can be read like math or English, because you simply
don’t need to declare the type of a variable or to manually cast them.
The latter (much due to libraries such as Pandas, NumPy and Scikit-learn) is something one
will appreciate if he were to implement a machine learning algorithm of which the core is
likely just mathematical optimization.
3. Program setup
3.1 Windows setup
1. Download Visual Studio Code
https://code.visualstudio.com/Download
2. Install Python plugin for VS Code
https://marketplace.visualstudio.com/items?itemName=ms-python.python
3. Download and install Python 3.7.1 64 bit for Window
https://www.python.org/downloads/release/python-371
(It is important to mark "Add Python 3.7 to PATH" and "Disable max path length" at
the end of the setup)
4. Open VS Code terminal (ctrl + ò)
5. In the terminal, type the "pip install pandas" command
6. Again, type "pip install sklearn" command
7. Finally, "py <homeworkpath.py>" to run the code and read the printed results
described and argued in this report in the following chapters.
3.2 Linux or macOS setup
As the same way of the Windows setup, it is important to download Python at 64 bit (adding
it to the path and disabling the max path length), then, you can run it in an IDE like Visual
www.robertodaguarcino.com
Studio or just in the terminal using the same commands, but it is important to replace the
“pip” command with “sudo pip3” and “py” with “sudo python3”.
4. Dataset preparation
4.1 Drebin dataset
As the limited resources impede monitoring applications at run-time, DREBIN performs a
broad static analysis, gathering as many features of an application as possible. These features
are embedded in a joint vector space, such that typical patterns indicative for malware can be
automatically identified and used for explaining the decisions of our method.
In an evaluation with 123,453 applications and 5,560 malware samples DREBIN outperforms
several related approaches and detects 94% of the malware with few false alarms, where the
explanations provided for each detection reveal relevant properties of the detected malware.
DREBIN performs a broad static analysis, gathering as many features from an application’s
code and manifest as possible. These features are organized in sets of strings (such as
permissions, API calls and network addresses) and embedded in a joint vector space.
As an example, an application sending premium SMS messages is cast to a specific region in
the vector space associated with the corresponding permissions, intents and API calls.
www.robertodaguarcino.com
To foster research in the area of malware detection and to enable a comparison of different
approaches, we make the malicious Android applications used in our work as well as all
extracted feature sets available to other researchers in the DREBIN dataset.
4.2 Data pre-processing
It is possible to analyze malware and to classify them in families with a Machine Learning
program. In this case I chosen to use Python as explained above.
Data preprocessing involves data preparation and dividing the data into training and testing
sets.
Drebin sha256_family.csv has been loaded as the dataset and so it has been used to data pre-
processing and data analysis through some methodologies and techniques. The dataset has
two elements: sha256 string (file name) and its own family, but I take only families with more
than 20 elements, while all the other families (and their own elements) have been deleted.
This, because families with less then 20 elements can be statistically irrelevant, or worse.
www.robertodaguarcino.com
Features of the files were loaded into Design Matrix 𝑿 ∈ 𝑅 𝑁×𝐷
and the output in 𝒚 ∈ 𝑅 𝑁
,
𝑦𝑖 ∈ {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, 𝑒𝑡𝑐. }:
𝑿 = [
𝑥11 ⋯ 𝑥1𝑛
⋮ ⋱ ⋮
𝑥 𝑛1 ⋯ 𝑥 𝑛𝑛
]
𝒚 = (
𝑦1
⋮
𝑦𝑛
)
4.3 From nominal data to numeric data
Nominal data (or categorical data) are variables that contain label values rather than numeric
values. The number of possible values is often limited to a fixed set. Categorical variables are
often called nominal.
Some examples include:
A “pet” variable with the values: “dog” and “cat“.
A “color” variable with the values: “red“, “green” and “blue“.
A “place” variable with the values: “first”, “second” and “third“.
Each value represents a different category.
Some categories may have a natural relationship to each other, such as a natural ordering.
The “place” variable above does have a natural ordering of values. This type of categorical
variable is called an ordinal variable.
What is the Problem with Categorical Data?
Some algorithms can work with categorical data directly.
www.robertodaguarcino.com
For example, a decision tree can be learned directly from categorical data with no data
transform required (this depends on the specific implementation).
Many machine learning algorithms cannot operate on label data directly. They require all
input variables and output variables to be numeric.
In general, this is mostly a constraint of the efficient implementation of machine learning
algorithms rather than hard limitations on the algorithms themselves.
This means that categorical data must be converted to a numerical form. If the categorical
variable is an output variable, you may also want to convert predictions by the model back
into a categorical form in order to present them or use them in some application.
How to Convert Categorical Data to Numerical Data?
This involves two steps: integer encoding and One-Hot Encoder.
4.4 Integer Encoding
As a first step, each unique category value is assigned an integer value.
For example, “red” is 1, “green” is 2, and “blue” is 3.
This is called a label encoding or an integer encoding and is easily reversible.
For some variables, this may be enough. The integer values have a natural ordered
relationship between each other and machine learning algorithms may be able to understand
and harness this relationship.
For example, ordinal variables like places would be a good example where a label/integer
encoding would be enough.
www.robertodaguarcino.com
4.5 One-Hot Encoder
For categorical variables where no such ordinal relationship exists, the integer encoding is not
enough.
In fact, using this encoding and allowing the model to assume a natural ordering between
categories may result in poor performance or unexpected results (predictions halfway
between categories).
In this case, a one-hot encoder can be applied to the integer representation. This is where the
integer encoded variable is removed and a new binary variable is added for each unique
integer value.
In the “color” variable example, there are 3 categories and therefore 3 binary variables are
needed. A “1” value is placed in the binary variable for the color and “0” values for the other
colors.
For example, for three different color elements:
red green blue
1 0 0
0 1 0
0 0 1
The binary variables are often called “dummy variables” in other fields, such as statistics.
In our case, I have applied this last technique on all the features, including the families and
the other inside the features’ vectors.
www.robertodaguarcino.com
5. Overfitting and underfitting avoidance
5.1 Bias and Variance
When evaluating a machine learning model, it is important to balance Bias and Variance.
High Bias refers to a scenario where the model is “underfitting” the dataset. This is bad
because the model is not presenting a very accurate or representative picture of the
relationship between inputs and predicted output and is often outputting high error.
High Variance represents the opposite scenario. In cases of High Variance or “overfitting”,
machine learning model is so accurate that it is perfectly fitted to your example dataset. While
this may seem like a good outcome, it is also a cause for concern, as such models often fail to
generalize to future datasets. So, while the model works well for existing data, it is not known
how well it will perform on other examples.
5.2 Training set and test set
Learning the parameters of a prediction function and testing it on the same data is a
methodological mistake: a model that would just repeat the labels of the samples that it has
just seen would have a perfect score but would fail to predict anything useful on yet-unseen
data. This situation is called overfitting. To avoid it, it is common practice when performing a
(supervised) machine learning experiment to hold out part of the available data as a test set
X_test, y_test. Note that the word “experiment” is not intended to denote academic use only,
because even in commercial settings machine learning usually starts out experimentally.
In scikit-learn a random split into training and test sets can be quickly computed with the
train_test_split helper function.
www.robertodaguarcino.com
5.3 Cross-validation
When evaluating different settings (“hyperparameters”) for estimators, such as the C setting
that must be manually set for an SVM, there is still a risk of overfitting on the test set because
the parameters can be tweaked until the estimator performs optimally. This way, knowledge
about the test set can “leak” into the model and evaluation metrics no longer report on
generalization performance. To solve this problem, yet another part of the dataset can be held
out as a so-called “validation set”: training proceeds on the training set, after which evaluation
is done on the validation set, and when the experiment seems to be successful, final
evaluation can be done on the test set.
However, by partitioning the available data into three sets, we drastically reduce the number
of samples which can be used for learning the model, and the results can depend on a
particular random choice for the pair of (train, validation) sets.
A solution to this problem is a procedure called cross-validation (CV for short). A test set
should still be held out for final evaluation, but the validation set is no longer needed when
doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets
(other approaches are described below, but generally follow the same principles). The
following procedure is followed for each of the k “folds”:
A model is trained using of the folds as training data;
the resulting model is validated on the remaining part of the data (i.e., it is used as a test set
to compute a performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values
computed in the loop. This approach can be computationally expensive but does not waste
too much data (as is the case when fixing an arbitrary validation set), which is a major
advantage in problems such as inverse inference where the number of samples is very small.
The simplest way to use cross-validation is to call the cross_val_score helper function on the
estimator and the dataset, but I have used also the cross_val_predict method in order to come
back from a binary to a multiclass problem.
www.robertodaguarcino.com
6. The classification problem
6.1 Random Forest
Random Forest is intrinsically suited for multiclass problems. It works well with a mixture of
numerical and categorical features. When features are on the various scales, it is also fine.
Roughly speaking, with Random Forest you can use data as they are. As a consequence, one-
hot encoder for categorical features is a must-do. Further, min-max or other scaling is highly
recommended at preprocessing step. At last, for a classification problem Random Forest
gives you probability of belonging to a class.
6.2 SVM
SVC and LinearSVC are classes capable of performing multi-class classification on a dataset.
LinearSVC is an implementation of Support Vector Classification for the case of a linear kernel.
SVC implements the One vs One approach for multiclass classification. On the other hand,
LinearSVC implements One vs Rest multiclass strategy, thus training n class models.
Both One vs One and One vs Rest strategies are discussed in the next chapters.
One-hot encoder discretization, in a first time, lead us to multiple binary classifiers instead of
a single multiclass classifier.
𝒚 = 𝒘 𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛 + 𝒘 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟 + ⋯
𝑤ℎ𝑒𝑟𝑒:
𝒘 𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛 ∈ {0,1}
𝒘 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟 ∈ {0,1}
⋮
6.3 Classify malwares by family
Running binary classifier for each family return partial functions:
www.robertodaguarcino.com
𝑓𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝑓𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, …
With Scikit-learn predict method we got:
𝒚̃𝑖, 𝑤ℎ𝑒𝑟𝑒 𝑖 = {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … },
𝒚̃𝑖 = 𝑦̃1, … , 𝑦̃ 𝑛,
𝑦̃1, … , 𝑦̃ 𝑛 ∈ {0,1}
6.4 Binary classifiers
Scikit-learn provides methods in the sklearn.metrics module such as accuracy_score,
precision_score, recall_score and f1_score (discussed later). These methods print
misclassifications and accuracy, precision, recall and f1 scores of every binary classifier of
each family.
In our case, we have good results for each binary classifier (discussed and argued in the last
chapter), but it is not enough to understand if the results are so good as they seem to be,
because at this moment we are considering each single classifier and not them all, nor the
probability that an element belongs to the predicted family or to the others.
When performing classification, you often want to predict not only the class, but also the
associated probability. This probability gives you some kind of confidence on the prediction.
However, not all classifiers provide well-calibrated probabilities, some being over-confident
while others being under-confident. Thus, a separate calibration of predicted probabilities is
often desirable as a postprocessing.
In the next chapters, I will go deep the various scores analyzing them re-calculated using
predict_proba method instead of the predict method, normalizing the probability and
applying One vs All methodology.
www.robertodaguarcino.com
6.5 From binary to multiclass
Some metrics are essentially defined for binary classification tasks. In these cases, by default
only the positive label is evaluated, assuming by default that the positive class is labelled 1
(though this may be configurable through the pos_label parameter).
In extending a binary metric to multiclass problems, the data is treated as a collection of binary
problems, one for each class. There are then several ways to average binary metric
calculations across the set of classes, each of which may be useful in some scenario. Where
available, Scikit-learn suggests us to select among these using the average parameter: "macro"
simply calculates the mean of the binary metrics, giving equal weight to each class. In
problems where infrequent classes are nonetheless important, macro-averaging may be a
means of highlighting their performance. On the other hand, the assumption that all classes
are equally important is often untrue, such that macro-averaging will over-emphasize the
typically low performance on an infrequent class; "weighted" accounts for class imbalance by
computing the average of binary metrics in which each class’s score is weighted by its
presence in the true data sample.
6.6 One vs All and One vs One
One vs All (also known as One vs Rest) strategy involves training a single classifier per class,
with the samples of that class as positive samples and all other samples as negatives. This
strategy requires the base classifiers to produce a real-valued confidence score for its
decision, rather than just a class label; discrete class labels alone can lead to ambiguities,
where multiple classes are predicted for a single sample. In the One vs One reduction, one
trains K (K − 1) / 2 binary classifiers for a K-way multiclass problem; each receives the
samples of a pair of classes from the original training set and must learn to distinguish these
two classes.
One vs All and One vs One methodologies bring us to the original problem of multiclass
classification.
www.robertodaguarcino.com
Making decisions means applying all classifiers to an unseen sample x and predicting the class
k for which the corresponding classifier reports the highest confidence score:
𝑦̂ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑓𝑘(𝑥), 𝑘 ∈ {1, … , 𝐾}
Thanks to predict_proba method of Scikit-learn we have:
𝒚̂𝑖, 𝑖 = {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … },
𝒚̂𝑖 = 𝑦̂1, … , 𝑦̂ 𝑛,
𝑦̂1, … , 𝑦̂ 𝑛 ∈ [0,1]
that represent this confidence score in a probability status.
To normalize the partial results (∑ 𝒚̅𝑖 𝑖
= 1):
𝒚̅𝑖, 𝑖 = {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … },
𝒚̅𝑖 =
𝒚̂𝑖
𝒚̂ 𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛 + 𝒚̂ 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟 + ⋯
At last, we can apply the majority rule with a threshold of 0.50 confidence score:
𝐼𝑓 𝒚̅𝑖 > 0.50 ⇒ 𝒚̅𝑖 ∈ 𝑖, 𝑖 = {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … }
else, it goes to a misclassification.
Once proceeds the confidence score (or misclassifications) for every family, we can calculate
scores to understand if the classification can be good enough. In the next paragraphs there
will be discussed each used score and in the final chapter it will be analyzed my code’s results.
6.7 Accuracy score
The accuracy_score function computes the accuracy, either the fraction (default) or the count
(normalize=False) of correct predictions. If 𝑦̂𝑖 is the predicted value of the 𝑖-th sample and 𝑦𝑖
is the corresponding true value, then the fraction of correct predictions over 𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 is
defined as
www.robertodaguarcino.com
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(𝑦, 𝑦̂) =
1
𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
∑ 1(𝑦̂𝑖 = 𝑦𝑖)
𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠−1
𝑖=0
Where 1(𝑥) is the indicator function (function defined on a set X that indicates membership
of an element in a subset A of X, having the value 1 for all elements of A and the value 0 for
all elements of X not in A).
Is accuracy score enough? No. Accuracy is not the be-all and end-all model metric to use when
selecting the best model. When performing classification, one often wants to predict not only
the class label, but also the associated probability. This probability gives confidence on the
prediction.
6.8 Confusion Matrix
Compute confusion matrix to evaluate the accuracy of a classification.
By definition a confusion matrix 𝐶 is such that 𝐶𝑖,𝑗 is equal to the number of observations
known to be in group 𝑖 but predicted to be in group 𝑗.
Thus, in binary classification, the count of true negatives is 𝐶0,0, false negatives is 𝐶1,0, true
positives is 𝐶1,1 and false positives is 𝐶0,1.
6.9 Precision score
Precision is the probability that a (randomly selected) retrieved document is relevant.
The precision is intuitively the ability of the classifier not to label as positive a sample that is
negative. The best value is 1 and the worst value is 0.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
www.robertodaguarcino.com
Immediately, you can see that Precision talks about how precise/accurate your model is out
of those predicted positive, how many of them are actual positive.
Precision is a good measure to determine, when the costs of False Positive is high. For
instance, email spam detection. In email spam detection, a false positive means that an email
that is non-spam (actual negative) has been identified as spam (predicted spam). The email
user might lose important emails if the precision is not high for the spam detection model.
6.10 Recall score
Recall is the probability that a (randomly selected) relevant document is retrieved in a
search.
The recall is intuitively the ability of the classifier to find all the positive samples. The best
value is 1 and the worst value is 0.
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
So, Recall actually calculates how many of the Actual Positives our model capture through
labeling it as Positive (True Positive).
Applying the same understanding, we know that Recall shall be the model metric we use to
select our best model when there is a high cost associated with False Negative.
For instance, in fraud detection or sick patient detection.
If a fraudulent transaction (Actual Positive) is predicted as non-fraudulent (Predicted
Negative), the consequence can be very bad for the bank.
Similarly, in sick patient detection. If a sick patient (Actual Positive) goes through the test and
predicted as not sick (Predicted Negative). The cost associated with False Negative will be
extremely high if the sickness is contagious.
www.robertodaguarcino.com
6.11 F1 score
F1 score, also known as balanced F-score or F-measure, can be interpreted as a weighted
average of the precision and recall, where an F1 score reaches its best value at 1 and worst
value at 0. The relative contribution of precision and recall to the F1 score are equal. In the
multi-class and multi-label case, this is the average of the F1 score of each class with weighting
depending on the average parameter.
𝐹1 = 2 ×
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
F1 Score is needed when you want to seek a balance between Precision and Recall. We have
previously seen that accuracy can be largely contributed by a large number of True Negatives
which in most business circumstances, we do not focus on much whereas False Negative and
False Positive usually has business costs (tangible & intangible) thus F1 Score might be a better
measure to use if we need to seek a balance between Precision and Recall and there is an
uneven class distribution (large number of Actual Negatives).
7. Final results
After reasoned and argued all my program and the procedures I followed to solve the
multiclass classification problem, I am going to write into this chapter results and
consequences of what I discovered and why.
All binary classifiers, applied on each family, have very good scores, including accuracy,
balanced accuracy, misclassification rate, recall, precision and f1.
Here are the results of some family, using a 3-fold cross validation with Random Forest.
Family #0: GinMaster
accuracy: [0.98825372 0.99451411 0.99137255]
www.robertodaguarcino.com
balanced accuracy: [0.90555556 0.97180064 0.96067416]
misclassification: 8
recall: [0.98747063 0.98981191 0.9945098 ]
precision: [0.99146515 0.99446995 0.99369643]
f1: [0.98285934 0.99449994 0.9903826 ]
Family # 1: FakeInstaller
accuracy: [0.97137931 0.97059561 0.9776489 ]
balanced accuracy: [0.95636344 0.96700107 0.97701111]
misclassification: 8
recall: [0.97137931 0.97059561 0.9776489 ]
precision: [0.97057825 0.96662691 0.97764725]
f1: [0.97213876 0.97058114 0.9784326 ]
Family # 2: Plankton
accuracy: [0.9784326 0.9784326 0.9792163]
balanced accuracy: [0.97667272 0.97047187 0.97954628]
misclassification: 3
recall: [0.9784326 0.9784326 0.9853673 ]
precision: [0.97921701 0.97922078 0.98145464 ]
f1: [0.97921725 0.97606713 0.98473473 ]
…
Family # 23: Boxer
accuracy: [0.97373041 0.97529781 0.97373041]
balanced accuracy: [0.5 0.57142857 0.57064055]
misclassification: 4
recall: [0.97451411 0.97529781 0.97373041]
precision: [0.96905831 0.97257331 0.97165532]
f1: [0.9713867 0.97299564 0.97376927]
www.robertodaguarcino.com
…
I am not going to print out results of SVM classifier for each family, because it is not so
important for our multiclass classification purposes, so it follows multiclass classification
problems results for both Random Forest and SVM after and before some consideration.
Then, we can come back to the multiclass classification to analysis its results.
The script I used to test the classifier implemented cross-validation and many other
techniques to avoid overfitting and to balance bias and variance. I was skeptical of the
relatively high precision, recall, and F1 score recorded of the single binary classifiers, but
looking through the script, I saw that the random seed for the cross-validation split was set at
some value in order to generate reproducible results. I changed the random seed and sure
enough, the performance of my model decreased. Therefore, I must have made the classic
mistake of overfitting on my training set for the given cross-validation random seed and
changed the seed again. Taking all of these precautions against overfitting, I had optimized
my model for a nonspecific set of data. In order to get a better indicator of the performance
of the Decision Tree model, I ran 10 tests with different random seeds and found the average
performance metrics. The final results for my model are summarized below:
Random Forest:
Average Accuracy Precision Recall F1
weighted 0.77 0.97 0.77 0.86
micro 0.76 0.76 0.76 0.76
macro 0.78 0.94 0.66 0.75
Linear SVC:
Average Accuracy Precision Recall F1
weighted 0.82 0.87 0.84 0.86
micro 0.81 0.81 0.81 0.81
macro 0.81 0.86 0.84 0.85
www.robertodaguarcino.com
The results are good given the nature of the classification. We can instantly notice that SVM
has quite better results than Random Forest.
Due to the small size of the available data, even a minor change such as altering the random
seed when doing a train-test split can have significant effects on the performance of the
algorithm which must be accounted for by testing over many subsets and calculating the
average performance.
This project points out and highlights the importance and goodness of machine learning while
trying to classify complex, intricate and artificial objects such as a malware, an activity that is
almost always too difficult (if not simply impossible) to be done by humans independently.
Roberto Falconi

More Related Content

What's hot

414351_Iason_Papapanagiotakis-bousy_Iason_Papapanagiotakis_Thesis_2360661_357...
414351_Iason_Papapanagiotakis-bousy_Iason_Papapanagiotakis_Thesis_2360661_357...414351_Iason_Papapanagiotakis-bousy_Iason_Papapanagiotakis_Thesis_2360661_357...
414351_Iason_Papapanagiotakis-bousy_Iason_Papapanagiotakis_Thesis_2360661_357...
Jason Papapanagiotakis
 
Linkset quality (LWDM 2013)
Linkset quality (LWDM 2013)Linkset quality (LWDM 2013)
Linkset quality (LWDM 2013)
Riccardo Albertoni
 
A hybrid model to detect malicious executables
A hybrid model to detect malicious executablesA hybrid model to detect malicious executables
A hybrid model to detect malicious executables
UltraUploader
 
An approach to source code plagiarism
An approach to source code plagiarismAn approach to source code plagiarism
An approach to source code plagiarism
varsha_bhat
 
A novel approach based on topic
A novel approach based on topicA novel approach based on topic
A novel approach based on topic
csandit
 
Architecture of a morphological malware detector
Architecture of a morphological malware detectorArchitecture of a morphological malware detector
Architecture of a morphological malware detector
UltraUploader
 
Sales_Prediction_Technique using R Programming
Sales_Prediction_Technique using R ProgrammingSales_Prediction_Technique using R Programming
Sales_Prediction_Technique using R Programming
Nagarjun Kotyada
 
An Approach to Software Testing of Machine Learning Applications
An Approach to Software Testing of Machine Learning ApplicationsAn Approach to Software Testing of Machine Learning Applications
An Approach to Software Testing of Machine Learning Applications
butest
 
Half-automatic Compilable Source Code Recovery
Half-automatic Compilable Source Code RecoveryHalf-automatic Compilable Source Code Recovery
Half-automatic Compilable Source Code Recovery
Joxean Koret
 
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTSUSING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
ijseajournal
 
Plagiarism introduction
Plagiarism introductionPlagiarism introduction
Plagiarism introduction
Merin Paul
 
Tweet sentiment analysis
Tweet sentiment analysisTweet sentiment analysis
Tweet sentiment analysis
Anil Shrestha
 
Susi
Susi Susi
Comparing static analysis in Visual Studio 2012 (Visual C++ 2012) and PVS-Studio
Comparing static analysis in Visual Studio 2012 (Visual C++ 2012) and PVS-StudioComparing static analysis in Visual Studio 2012 (Visual C++ 2012) and PVS-Studio
Comparing static analysis in Visual Studio 2012 (Visual C++ 2012) and PVS-Studio
PVS-Studio
 
++Matlab 14 sesiones
++Matlab 14 sesiones++Matlab 14 sesiones
++Matlab 14 sesiones
Rosemberth Rodriguez
 
Abstract
AbstractAbstract
Abstract
Suresh Prabhu
 
Marvin_Capstone
Marvin_CapstoneMarvin_Capstone
Marvin_Capstone
Marvin Bertin
 
150104 3 methods for-binary_analysis_and_valgrind
150104 3 methods for-binary_analysis_and_valgrind150104 3 methods for-binary_analysis_and_valgrind
150104 3 methods for-binary_analysis_and_valgrind
Raghu Palakodety
 
Mining Fix Patterns for FindBugs Violations
Mining Fix Patterns for FindBugs ViolationsMining Fix Patterns for FindBugs Violations
Mining Fix Patterns for FindBugs Violations
Dongsun Kim
 
IRJET - Twitter Sentimental Analysis
IRJET -  	  Twitter Sentimental AnalysisIRJET -  	  Twitter Sentimental Analysis
IRJET - Twitter Sentimental Analysis
IRJET Journal
 

What's hot (20)

414351_Iason_Papapanagiotakis-bousy_Iason_Papapanagiotakis_Thesis_2360661_357...
414351_Iason_Papapanagiotakis-bousy_Iason_Papapanagiotakis_Thesis_2360661_357...414351_Iason_Papapanagiotakis-bousy_Iason_Papapanagiotakis_Thesis_2360661_357...
414351_Iason_Papapanagiotakis-bousy_Iason_Papapanagiotakis_Thesis_2360661_357...
 
Linkset quality (LWDM 2013)
Linkset quality (LWDM 2013)Linkset quality (LWDM 2013)
Linkset quality (LWDM 2013)
 
A hybrid model to detect malicious executables
A hybrid model to detect malicious executablesA hybrid model to detect malicious executables
A hybrid model to detect malicious executables
 
An approach to source code plagiarism
An approach to source code plagiarismAn approach to source code plagiarism
An approach to source code plagiarism
 
A novel approach based on topic
A novel approach based on topicA novel approach based on topic
A novel approach based on topic
 
Architecture of a morphological malware detector
Architecture of a morphological malware detectorArchitecture of a morphological malware detector
Architecture of a morphological malware detector
 
Sales_Prediction_Technique using R Programming
Sales_Prediction_Technique using R ProgrammingSales_Prediction_Technique using R Programming
Sales_Prediction_Technique using R Programming
 
An Approach to Software Testing of Machine Learning Applications
An Approach to Software Testing of Machine Learning ApplicationsAn Approach to Software Testing of Machine Learning Applications
An Approach to Software Testing of Machine Learning Applications
 
Half-automatic Compilable Source Code Recovery
Half-automatic Compilable Source Code RecoveryHalf-automatic Compilable Source Code Recovery
Half-automatic Compilable Source Code Recovery
 
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTSUSING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
 
Plagiarism introduction
Plagiarism introductionPlagiarism introduction
Plagiarism introduction
 
Tweet sentiment analysis
Tweet sentiment analysisTweet sentiment analysis
Tweet sentiment analysis
 
Susi
Susi Susi
Susi
 
Comparing static analysis in Visual Studio 2012 (Visual C++ 2012) and PVS-Studio
Comparing static analysis in Visual Studio 2012 (Visual C++ 2012) and PVS-StudioComparing static analysis in Visual Studio 2012 (Visual C++ 2012) and PVS-Studio
Comparing static analysis in Visual Studio 2012 (Visual C++ 2012) and PVS-Studio
 
++Matlab 14 sesiones
++Matlab 14 sesiones++Matlab 14 sesiones
++Matlab 14 sesiones
 
Abstract
AbstractAbstract
Abstract
 
Marvin_Capstone
Marvin_CapstoneMarvin_Capstone
Marvin_Capstone
 
150104 3 methods for-binary_analysis_and_valgrind
150104 3 methods for-binary_analysis_and_valgrind150104 3 methods for-binary_analysis_and_valgrind
150104 3 methods for-binary_analysis_and_valgrind
 
Mining Fix Patterns for FindBugs Violations
Mining Fix Patterns for FindBugs ViolationsMining Fix Patterns for FindBugs Violations
Mining Fix Patterns for FindBugs Violations
 
IRJET - Twitter Sentimental Analysis
IRJET -  	  Twitter Sentimental AnalysisIRJET -  	  Twitter Sentimental Analysis
IRJET - Twitter Sentimental Analysis
 

Similar to Malware analysis

IRJET - Automation in Python using Speech Recognition
IRJET -  	  Automation in Python using Speech RecognitionIRJET -  	  Automation in Python using Speech Recognition
IRJET - Automation in Python using Speech Recognition
IRJET Journal
 
Through the firewall with miniCRAN
Through the firewall with miniCRANThrough the firewall with miniCRAN
Through the firewall with miniCRAN
Revolution Analytics
 
Parallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisParallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets Analysis
Illia Ovchynnikov
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
rohithprabhas1
 
Malware Analysis Tips and Tricks.pdf
Malware Analysis Tips and Tricks.pdfMalware Analysis Tips and Tricks.pdf
Malware Analysis Tips and Tricks.pdf
Yushimon
 
nullcon 2011 - Fuzzing with Complexities
nullcon 2011 - Fuzzing with Complexitiesnullcon 2011 - Fuzzing with Complexities
nullcon 2011 - Fuzzing with Complexities
n|u - The Open Security Community
 
Raptor user manual3.0
Raptor user manual3.0Raptor user manual3.0
Raptor user manual3.0
Elizabeth Reyna
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
Derek Kane
 
Proposal with sdlc
Proposal with sdlcProposal with sdlc
Proposal with sdlc
Kamau Francis
 
IRJET - Pseudocode to Python Translation using Machine Learning
IRJET - Pseudocode to Python Translation using Machine LearningIRJET - Pseudocode to Python Translation using Machine Learning
IRJET - Pseudocode to Python Translation using Machine Learning
IRJET Journal
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
Vipul Divyanshu
 
Consumer centric api design v0.4.0
Consumer centric api design v0.4.0Consumer centric api design v0.4.0
Consumer centric api design v0.4.0
mustafa sarac
 
Django
DjangoDjango
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben Ford
Puppet
 
Ben ford intro
Ben ford introBen ford intro
Ben ford intro
Puppet
 
Reproducible Research in R and R Studio
Reproducible Research in R and R StudioReproducible Research in R and R Studio
Reproducible Research in R and R Studio
Susan Johnston
 
Pyhton-1a-Basics.pdf
Pyhton-1a-Basics.pdfPyhton-1a-Basics.pdf
Pyhton-1a-Basics.pdf
Mattupallipardhu
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
 
malware_detection_data_mining
malware_detection_data_miningmalware_detection_data_mining
malware_detection_data_mining
David Zivi
 
A Whirlwind Tour Of Python
A Whirlwind Tour Of PythonA Whirlwind Tour Of Python
A Whirlwind Tour Of Python
Asia Smith
 

Similar to Malware analysis (20)

IRJET - Automation in Python using Speech Recognition
IRJET -  	  Automation in Python using Speech RecognitionIRJET -  	  Automation in Python using Speech Recognition
IRJET - Automation in Python using Speech Recognition
 
Through the firewall with miniCRAN
Through the firewall with miniCRANThrough the firewall with miniCRAN
Through the firewall with miniCRAN
 
Parallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisParallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets Analysis
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
Malware Analysis Tips and Tricks.pdf
Malware Analysis Tips and Tricks.pdfMalware Analysis Tips and Tricks.pdf
Malware Analysis Tips and Tricks.pdf
 
nullcon 2011 - Fuzzing with Complexities
nullcon 2011 - Fuzzing with Complexitiesnullcon 2011 - Fuzzing with Complexities
nullcon 2011 - Fuzzing with Complexities
 
Raptor user manual3.0
Raptor user manual3.0Raptor user manual3.0
Raptor user manual3.0
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
 
Proposal with sdlc
Proposal with sdlcProposal with sdlc
Proposal with sdlc
 
IRJET - Pseudocode to Python Translation using Machine Learning
IRJET - Pseudocode to Python Translation using Machine LearningIRJET - Pseudocode to Python Translation using Machine Learning
IRJET - Pseudocode to Python Translation using Machine Learning
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
Consumer centric api design v0.4.0
Consumer centric api design v0.4.0Consumer centric api design v0.4.0
Consumer centric api design v0.4.0
 
Django
DjangoDjango
Django
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben Ford
 
Ben ford intro
Ben ford introBen ford intro
Ben ford intro
 
Reproducible Research in R and R Studio
Reproducible Research in R and R StudioReproducible Research in R and R Studio
Reproducible Research in R and R Studio
 
Pyhton-1a-Basics.pdf
Pyhton-1a-Basics.pdfPyhton-1a-Basics.pdf
Pyhton-1a-Basics.pdf
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
malware_detection_data_mining
malware_detection_data_miningmalware_detection_data_mining
malware_detection_data_mining
 
A Whirlwind Tour Of Python
A Whirlwind Tour Of PythonA Whirlwind Tour Of Python
A Whirlwind Tour Of Python
 

More from Roberto Falconi

River Trail: A Path to Parallelism in JavaScript
River Trail: A Path to Parallelism in JavaScriptRiver Trail: A Path to Parallelism in JavaScript
River Trail: A Path to Parallelism in JavaScript
Roberto Falconi
 
Biometric Systems - Automate Video Streaming Analysis with Azure and AWS
Biometric Systems - Automate Video Streaming Analysis with Azure and AWSBiometric Systems - Automate Video Streaming Analysis with Azure and AWS
Biometric Systems - Automate Video Streaming Analysis with Azure and AWS
Roberto Falconi
 
Biometric Systems - Automate Video Streaming Analysis with Azure and AWS
Biometric Systems - Automate Video Streaming Analysis with Azure and AWSBiometric Systems - Automate Video Streaming Analysis with Azure and AWS
Biometric Systems - Automate Video Streaming Analysis with Azure and AWS
Roberto Falconi
 
Black-Box attacks against Neural Networks - technical project presentation
Black-Box attacks against Neural Networks - technical project presentationBlack-Box attacks against Neural Networks - technical project presentation
Black-Box attacks against Neural Networks - technical project presentation
Roberto Falconi
 
SUOMI - UCD approach to build an IoT smart guide for spa
SUOMI - UCD approach to build an IoT smart guide for spaSUOMI - UCD approach to build an IoT smart guide for spa
SUOMI - UCD approach to build an IoT smart guide for spa
Roberto Falconi
 
Kalypso: She who hides. Encryption and decryption web app.
Kalypso: She who hides. Encryption and decryption web app.Kalypso: She who hides. Encryption and decryption web app.
Kalypso: She who hides. Encryption and decryption web app.
Roberto Falconi
 
Game ratings predictor
Game ratings predictorGame ratings predictor
Game ratings predictor
Roberto Falconi
 
Bb 8 run - a star wars video game
Bb 8 run - a star wars video gameBb 8 run - a star wars video game
Bb 8 run - a star wars video game
Roberto Falconi
 
Visual Analytics: Traffic Collisions in Italy
Visual Analytics: Traffic Collisions in ItalyVisual Analytics: Traffic Collisions in Italy
Visual Analytics: Traffic Collisions in Italy
Roberto Falconi
 
Visual Analytics: Traffic Collisions in Italy
Visual Analytics: Traffic Collisions in ItalyVisual Analytics: Traffic Collisions in Italy
Visual Analytics: Traffic Collisions in Italy
Roberto Falconi
 
SUOMI - Web and mobile app for spa users, using STM32 IoT, Microsoft Azure Cl...
SUOMI - Web and mobile app for spa users, using STM32 IoT, Microsoft Azure Cl...SUOMI - Web and mobile app for spa users, using STM32 IoT, Microsoft Azure Cl...
SUOMI - Web and mobile app for spa users, using STM32 IoT, Microsoft Azure Cl...
Roberto Falconi
 
BB8 RUN - A Star Wars video game
BB8 RUN - A Star Wars video gameBB8 RUN - A Star Wars video game
BB8 RUN - A Star Wars video game
Roberto Falconi
 
Game Ratings Predictor - machine learning software to predict video games co...
Game Ratings Predictor  - machine learning software to predict video games co...Game Ratings Predictor  - machine learning software to predict video games co...
Game Ratings Predictor - machine learning software to predict video games co...
Roberto Falconi
 
House Temperature Monitoring using AWS IoT And Raspberry Pi
House Temperature Monitoring using AWS IoT And Raspberry PiHouse Temperature Monitoring using AWS IoT And Raspberry Pi
House Temperature Monitoring using AWS IoT And Raspberry Pi
Roberto Falconi
 

More from Roberto Falconi (14)

River Trail: A Path to Parallelism in JavaScript
River Trail: A Path to Parallelism in JavaScriptRiver Trail: A Path to Parallelism in JavaScript
River Trail: A Path to Parallelism in JavaScript
 
Biometric Systems - Automate Video Streaming Analysis with Azure and AWS
Biometric Systems - Automate Video Streaming Analysis with Azure and AWSBiometric Systems - Automate Video Streaming Analysis with Azure and AWS
Biometric Systems - Automate Video Streaming Analysis with Azure and AWS
 
Biometric Systems - Automate Video Streaming Analysis with Azure and AWS
Biometric Systems - Automate Video Streaming Analysis with Azure and AWSBiometric Systems - Automate Video Streaming Analysis with Azure and AWS
Biometric Systems - Automate Video Streaming Analysis with Azure and AWS
 
Black-Box attacks against Neural Networks - technical project presentation
Black-Box attacks against Neural Networks - technical project presentationBlack-Box attacks against Neural Networks - technical project presentation
Black-Box attacks against Neural Networks - technical project presentation
 
SUOMI - UCD approach to build an IoT smart guide for spa
SUOMI - UCD approach to build an IoT smart guide for spaSUOMI - UCD approach to build an IoT smart guide for spa
SUOMI - UCD approach to build an IoT smart guide for spa
 
Kalypso: She who hides. Encryption and decryption web app.
Kalypso: She who hides. Encryption and decryption web app.Kalypso: She who hides. Encryption and decryption web app.
Kalypso: She who hides. Encryption and decryption web app.
 
Game ratings predictor
Game ratings predictorGame ratings predictor
Game ratings predictor
 
Bb 8 run - a star wars video game
Bb 8 run - a star wars video gameBb 8 run - a star wars video game
Bb 8 run - a star wars video game
 
Visual Analytics: Traffic Collisions in Italy
Visual Analytics: Traffic Collisions in ItalyVisual Analytics: Traffic Collisions in Italy
Visual Analytics: Traffic Collisions in Italy
 
Visual Analytics: Traffic Collisions in Italy
Visual Analytics: Traffic Collisions in ItalyVisual Analytics: Traffic Collisions in Italy
Visual Analytics: Traffic Collisions in Italy
 
SUOMI - Web and mobile app for spa users, using STM32 IoT, Microsoft Azure Cl...
SUOMI - Web and mobile app for spa users, using STM32 IoT, Microsoft Azure Cl...SUOMI - Web and mobile app for spa users, using STM32 IoT, Microsoft Azure Cl...
SUOMI - Web and mobile app for spa users, using STM32 IoT, Microsoft Azure Cl...
 
BB8 RUN - A Star Wars video game
BB8 RUN - A Star Wars video gameBB8 RUN - A Star Wars video game
BB8 RUN - A Star Wars video game
 
Game Ratings Predictor - machine learning software to predict video games co...
Game Ratings Predictor  - machine learning software to predict video games co...Game Ratings Predictor  - machine learning software to predict video games co...
Game Ratings Predictor - machine learning software to predict video games co...
 
House Temperature Monitoring using AWS IoT And Raspberry Pi
House Temperature Monitoring using AWS IoT And Raspberry PiHouse Temperature Monitoring using AWS IoT And Raspberry Pi
House Temperature Monitoring using AWS IoT And Raspberry Pi
 

Recently uploaded

Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Envertis Software Solutions
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
Yara Milbes
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfRevolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Undress Baby
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
Gerardo Pardo-Castellote
 

Recently uploaded (20)

Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfRevolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
 

Malware analysis

  • 1. www.robertodaguarcino.com Master of Engineering in Computer Science Machine Learning for Malware Analysis Bayesian approach for malwares classification by family Roberto Falconi
  • 2. www.robertodaguarcino.com Summary 1. Abstract...............................................................................................................................3 2. Why I used Python..............................................................................................................3 3. Program setup.....................................................................................................................4 3.1 Windows setup............................................................................................................4 3.2 Linux or macOS setup..................................................................................................4 4. Dataset preparation............................................................................................................5 4.1 Drebin dataset .............................................................................................................5 4.2 Data pre-processing.....................................................................................................6 4.3 From nominal data to numeric data ...........................................................................7 4.4 Integer Encoding..........................................................................................................8 4.5 One-Hot Encoder.........................................................................................................9 5. Overfitting and underfitting avoidance ............................................................................10 5.1 Bias and Variance ......................................................................................................10 5.2 Training set and test set ............................................................................................10 5.3 Cross-validation .........................................................................................................11 6. The classification problem ................................................................................................12 6.1 Random Forest ..........................................................................................................12 6.2 SVM............................................................................................................................12 6.3 Classify malwares by family.......................................................................................12 6.4 Binary classifiers ........................................................................................................13 6.5 From binary to multiclass..........................................................................................14 6.6 One vs All and One vs One.........................................................................................14 6.7 Accuracy score...........................................................................................................15 6.8 Confusion Matrix .......................................................................................................16 6.9 Precision score...........................................................................................................16 6.10 Recall score ............................................................................................................17 6.11 F1 score..................................................................................................................18 7. Final results.......................................................................................................................18
  • 3. www.robertodaguarcino.com 1. Abstract Goal of this project is to understand the data contained in the DREBIN dataset to define a classification problem for malware analysis (target function). I decided to make a multiclass classification: I classified all malwares by family. First, I classified malwares by family using binary classifiers for each family. Second, I calculated the probability that a malware really belongs to some family and finally, with One vs All and other methodologies, I came back to the multiclass problem to classify the malware by family. The evaluation procedure and the results are described in the following report and it includes (but it is not limited to) dataset preprocessing, integer encoding, one-hot encoder, overfitting avoidance using dataset splitting into training set and test set and cross validation, the usage of classifiers such as Random Forest and Support Vector Machine (also known as SVM or Support Vector Classification and SVC in Scikit-learn), and the coming back from binary to multiclass problem using One vs All and finally the calculation of confusion matrix, accuracy, misclassification rate, precision, recall, f1 and others (each described and argued). 2. Why I used Python First, I want to explain why I used Python to develop this project. Python is popular in machine learning because of many inter-related reasons: it is simple, elegant, consistent, and math-like. Python code has been described as readable pseudocode. It is easy to pick up due to its consistent syntax and the way it mirrors human language and/or their mathematical counterparts.
  • 4. www.robertodaguarcino.com It is math-like in that some "objects" that are very much part of a mathematician's vocabulary are part of the language without you having to install / import them, and they resemble their equivalent mathematical counterparts. With carefully chosen variable/function names, the code can be read like math or English, because you simply don’t need to declare the type of a variable or to manually cast them. The latter (much due to libraries such as Pandas, NumPy and Scikit-learn) is something one will appreciate if he were to implement a machine learning algorithm of which the core is likely just mathematical optimization. 3. Program setup 3.1 Windows setup 1. Download Visual Studio Code https://code.visualstudio.com/Download 2. Install Python plugin for VS Code https://marketplace.visualstudio.com/items?itemName=ms-python.python 3. Download and install Python 3.7.1 64 bit for Window https://www.python.org/downloads/release/python-371 (It is important to mark "Add Python 3.7 to PATH" and "Disable max path length" at the end of the setup) 4. Open VS Code terminal (ctrl + ò) 5. In the terminal, type the "pip install pandas" command 6. Again, type "pip install sklearn" command 7. Finally, "py <homeworkpath.py>" to run the code and read the printed results described and argued in this report in the following chapters. 3.2 Linux or macOS setup As the same way of the Windows setup, it is important to download Python at 64 bit (adding it to the path and disabling the max path length), then, you can run it in an IDE like Visual
  • 5. www.robertodaguarcino.com Studio or just in the terminal using the same commands, but it is important to replace the “pip” command with “sudo pip3” and “py” with “sudo python3”. 4. Dataset preparation 4.1 Drebin dataset As the limited resources impede monitoring applications at run-time, DREBIN performs a broad static analysis, gathering as many features of an application as possible. These features are embedded in a joint vector space, such that typical patterns indicative for malware can be automatically identified and used for explaining the decisions of our method. In an evaluation with 123,453 applications and 5,560 malware samples DREBIN outperforms several related approaches and detects 94% of the malware with few false alarms, where the explanations provided for each detection reveal relevant properties of the detected malware. DREBIN performs a broad static analysis, gathering as many features from an application’s code and manifest as possible. These features are organized in sets of strings (such as permissions, API calls and network addresses) and embedded in a joint vector space. As an example, an application sending premium SMS messages is cast to a specific region in the vector space associated with the corresponding permissions, intents and API calls.
  • 6. www.robertodaguarcino.com To foster research in the area of malware detection and to enable a comparison of different approaches, we make the malicious Android applications used in our work as well as all extracted feature sets available to other researchers in the DREBIN dataset. 4.2 Data pre-processing It is possible to analyze malware and to classify them in families with a Machine Learning program. In this case I chosen to use Python as explained above. Data preprocessing involves data preparation and dividing the data into training and testing sets. Drebin sha256_family.csv has been loaded as the dataset and so it has been used to data pre- processing and data analysis through some methodologies and techniques. The dataset has two elements: sha256 string (file name) and its own family, but I take only families with more than 20 elements, while all the other families (and their own elements) have been deleted. This, because families with less then 20 elements can be statistically irrelevant, or worse.
  • 7. www.robertodaguarcino.com Features of the files were loaded into Design Matrix 𝑿 ∈ 𝑅 𝑁×𝐷 and the output in 𝒚 ∈ 𝑅 𝑁 , 𝑦𝑖 ∈ {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, 𝑒𝑡𝑐. }: 𝑿 = [ 𝑥11 ⋯ 𝑥1𝑛 ⋮ ⋱ ⋮ 𝑥 𝑛1 ⋯ 𝑥 𝑛𝑛 ] 𝒚 = ( 𝑦1 ⋮ 𝑦𝑛 ) 4.3 From nominal data to numeric data Nominal data (or categorical data) are variables that contain label values rather than numeric values. The number of possible values is often limited to a fixed set. Categorical variables are often called nominal. Some examples include: A “pet” variable with the values: “dog” and “cat“. A “color” variable with the values: “red“, “green” and “blue“. A “place” variable with the values: “first”, “second” and “third“. Each value represents a different category. Some categories may have a natural relationship to each other, such as a natural ordering. The “place” variable above does have a natural ordering of values. This type of categorical variable is called an ordinal variable. What is the Problem with Categorical Data? Some algorithms can work with categorical data directly.
  • 8. www.robertodaguarcino.com For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation). Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric. In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves. This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application. How to Convert Categorical Data to Numerical Data? This involves two steps: integer encoding and One-Hot Encoder. 4.4 Integer Encoding As a first step, each unique category value is assigned an integer value. For example, “red” is 1, “green” is 2, and “blue” is 3. This is called a label encoding or an integer encoding and is easily reversible. For some variables, this may be enough. The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship. For example, ordinal variables like places would be a good example where a label/integer encoding would be enough.
  • 9. www.robertodaguarcino.com 4.5 One-Hot Encoder For categorical variables where no such ordinal relationship exists, the integer encoding is not enough. In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories). In this case, a one-hot encoder can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value. In the “color” variable example, there are 3 categories and therefore 3 binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors. For example, for three different color elements: red green blue 1 0 0 0 1 0 0 0 1 The binary variables are often called “dummy variables” in other fields, such as statistics. In our case, I have applied this last technique on all the features, including the families and the other inside the features’ vectors.
  • 10. www.robertodaguarcino.com 5. Overfitting and underfitting avoidance 5.1 Bias and Variance When evaluating a machine learning model, it is important to balance Bias and Variance. High Bias refers to a scenario where the model is “underfitting” the dataset. This is bad because the model is not presenting a very accurate or representative picture of the relationship between inputs and predicted output and is often outputting high error. High Variance represents the opposite scenario. In cases of High Variance or “overfitting”, machine learning model is so accurate that it is perfectly fitted to your example dataset. While this may seem like a good outcome, it is also a cause for concern, as such models often fail to generalize to future datasets. So, while the model works well for existing data, it is not known how well it will perform on other examples. 5.2 Training set and test set Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. Note that the word “experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.
  • 11. www.robertodaguarcino.com 5.3 Cross-validation When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set. However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets. A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”: A model is trained using of the folds as training data; the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy). The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small. The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset, but I have used also the cross_val_predict method in order to come back from a binary to a multiclass problem.
  • 12. www.robertodaguarcino.com 6. The classification problem 6.1 Random Forest Random Forest is intrinsically suited for multiclass problems. It works well with a mixture of numerical and categorical features. When features are on the various scales, it is also fine. Roughly speaking, with Random Forest you can use data as they are. As a consequence, one- hot encoder for categorical features is a must-do. Further, min-max or other scaling is highly recommended at preprocessing step. At last, for a classification problem Random Forest gives you probability of belonging to a class. 6.2 SVM SVC and LinearSVC are classes capable of performing multi-class classification on a dataset. LinearSVC is an implementation of Support Vector Classification for the case of a linear kernel. SVC implements the One vs One approach for multiclass classification. On the other hand, LinearSVC implements One vs Rest multiclass strategy, thus training n class models. Both One vs One and One vs Rest strategies are discussed in the next chapters. One-hot encoder discretization, in a first time, lead us to multiple binary classifiers instead of a single multiclass classifier. 𝒚 = 𝒘 𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛 + 𝒘 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟 + ⋯ 𝑤ℎ𝑒𝑟𝑒: 𝒘 𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛 ∈ {0,1} 𝒘 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟 ∈ {0,1} ⋮ 6.3 Classify malwares by family Running binary classifier for each family return partial functions:
  • 13. www.robertodaguarcino.com 𝑓𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝑓𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … With Scikit-learn predict method we got: 𝒚̃𝑖, 𝑤ℎ𝑒𝑟𝑒 𝑖 = {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … }, 𝒚̃𝑖 = 𝑦̃1, … , 𝑦̃ 𝑛, 𝑦̃1, … , 𝑦̃ 𝑛 ∈ {0,1} 6.4 Binary classifiers Scikit-learn provides methods in the sklearn.metrics module such as accuracy_score, precision_score, recall_score and f1_score (discussed later). These methods print misclassifications and accuracy, precision, recall and f1 scores of every binary classifier of each family. In our case, we have good results for each binary classifier (discussed and argued in the last chapter), but it is not enough to understand if the results are so good as they seem to be, because at this moment we are considering each single classifier and not them all, nor the probability that an element belongs to the predicted family or to the others. When performing classification, you often want to predict not only the class, but also the associated probability. This probability gives you some kind of confidence on the prediction. However, not all classifiers provide well-calibrated probabilities, some being over-confident while others being under-confident. Thus, a separate calibration of predicted probabilities is often desirable as a postprocessing. In the next chapters, I will go deep the various scores analyzing them re-calculated using predict_proba method instead of the predict method, normalizing the probability and applying One vs All methodology.
  • 14. www.robertodaguarcino.com 6.5 From binary to multiclass Some metrics are essentially defined for binary classification tasks. In these cases, by default only the positive label is evaluated, assuming by default that the positive class is labelled 1 (though this may be configurable through the pos_label parameter). In extending a binary metric to multiclass problems, the data is treated as a collection of binary problems, one for each class. There are then several ways to average binary metric calculations across the set of classes, each of which may be useful in some scenario. Where available, Scikit-learn suggests us to select among these using the average parameter: "macro" simply calculates the mean of the binary metrics, giving equal weight to each class. In problems where infrequent classes are nonetheless important, macro-averaging may be a means of highlighting their performance. On the other hand, the assumption that all classes are equally important is often untrue, such that macro-averaging will over-emphasize the typically low performance on an infrequent class; "weighted" accounts for class imbalance by computing the average of binary metrics in which each class’s score is weighted by its presence in the true data sample. 6.6 One vs All and One vs One One vs All (also known as One vs Rest) strategy involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. This strategy requires the base classifiers to produce a real-valued confidence score for its decision, rather than just a class label; discrete class labels alone can lead to ambiguities, where multiple classes are predicted for a single sample. In the One vs One reduction, one trains K (K − 1) / 2 binary classifiers for a K-way multiclass problem; each receives the samples of a pair of classes from the original training set and must learn to distinguish these two classes. One vs All and One vs One methodologies bring us to the original problem of multiclass classification.
  • 15. www.robertodaguarcino.com Making decisions means applying all classifiers to an unseen sample x and predicting the class k for which the corresponding classifier reports the highest confidence score: 𝑦̂ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑓𝑘(𝑥), 𝑘 ∈ {1, … , 𝐾} Thanks to predict_proba method of Scikit-learn we have: 𝒚̂𝑖, 𝑖 = {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … }, 𝒚̂𝑖 = 𝑦̂1, … , 𝑦̂ 𝑛, 𝑦̂1, … , 𝑦̂ 𝑛 ∈ [0,1] that represent this confidence score in a probability status. To normalize the partial results (∑ 𝒚̅𝑖 𝑖 = 1): 𝒚̅𝑖, 𝑖 = {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … }, 𝒚̅𝑖 = 𝒚̂𝑖 𝒚̂ 𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛 + 𝒚̂ 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟 + ⋯ At last, we can apply the majority rule with a threshold of 0.50 confidence score: 𝐼𝑓 𝒚̅𝑖 > 0.50 ⇒ 𝒚̅𝑖 ∈ 𝑖, 𝑖 = {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … } else, it goes to a misclassification. Once proceeds the confidence score (or misclassifications) for every family, we can calculate scores to understand if the classification can be good enough. In the next paragraphs there will be discussed each used score and in the final chapter it will be analyzed my code’s results. 6.7 Accuracy score The accuracy_score function computes the accuracy, either the fraction (default) or the count (normalize=False) of correct predictions. If 𝑦̂𝑖 is the predicted value of the 𝑖-th sample and 𝑦𝑖 is the corresponding true value, then the fraction of correct predictions over 𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 is defined as
  • 16. www.robertodaguarcino.com 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(𝑦, 𝑦̂) = 1 𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 ∑ 1(𝑦̂𝑖 = 𝑦𝑖) 𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠−1 𝑖=0 Where 1(𝑥) is the indicator function (function defined on a set X that indicates membership of an element in a subset A of X, having the value 1 for all elements of A and the value 0 for all elements of X not in A). Is accuracy score enough? No. Accuracy is not the be-all and end-all model metric to use when selecting the best model. When performing classification, one often wants to predict not only the class label, but also the associated probability. This probability gives confidence on the prediction. 6.8 Confusion Matrix Compute confusion matrix to evaluate the accuracy of a classification. By definition a confusion matrix 𝐶 is such that 𝐶𝑖,𝑗 is equal to the number of observations known to be in group 𝑖 but predicted to be in group 𝑗. Thus, in binary classification, the count of true negatives is 𝐶0,0, false negatives is 𝐶1,0, true positives is 𝐶1,1 and false positives is 𝐶0,1. 6.9 Precision score Precision is the probability that a (randomly selected) retrieved document is relevant. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative. The best value is 1 and the worst value is 0. 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
  • 17. www.robertodaguarcino.com Immediately, you can see that Precision talks about how precise/accurate your model is out of those predicted positive, how many of them are actual positive. Precision is a good measure to determine, when the costs of False Positive is high. For instance, email spam detection. In email spam detection, a false positive means that an email that is non-spam (actual negative) has been identified as spam (predicted spam). The email user might lose important emails if the precision is not high for the spam detection model. 6.10 Recall score Recall is the probability that a (randomly selected) relevant document is retrieved in a search. The recall is intuitively the ability of the classifier to find all the positive samples. The best value is 1 and the worst value is 0. 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 So, Recall actually calculates how many of the Actual Positives our model capture through labeling it as Positive (True Positive). Applying the same understanding, we know that Recall shall be the model metric we use to select our best model when there is a high cost associated with False Negative. For instance, in fraud detection or sick patient detection. If a fraudulent transaction (Actual Positive) is predicted as non-fraudulent (Predicted Negative), the consequence can be very bad for the bank. Similarly, in sick patient detection. If a sick patient (Actual Positive) goes through the test and predicted as not sick (Predicted Negative). The cost associated with False Negative will be extremely high if the sickness is contagious.
  • 18. www.robertodaguarcino.com 6.11 F1 score F1 score, also known as balanced F-score or F-measure, can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst value at 0. The relative contribution of precision and recall to the F1 score are equal. In the multi-class and multi-label case, this is the average of the F1 score of each class with weighting depending on the average parameter. 𝐹1 = 2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 F1 Score is needed when you want to seek a balance between Precision and Recall. We have previously seen that accuracy can be largely contributed by a large number of True Negatives which in most business circumstances, we do not focus on much whereas False Negative and False Positive usually has business costs (tangible & intangible) thus F1 Score might be a better measure to use if we need to seek a balance between Precision and Recall and there is an uneven class distribution (large number of Actual Negatives). 7. Final results After reasoned and argued all my program and the procedures I followed to solve the multiclass classification problem, I am going to write into this chapter results and consequences of what I discovered and why. All binary classifiers, applied on each family, have very good scores, including accuracy, balanced accuracy, misclassification rate, recall, precision and f1. Here are the results of some family, using a 3-fold cross validation with Random Forest. Family #0: GinMaster accuracy: [0.98825372 0.99451411 0.99137255]
  • 19. www.robertodaguarcino.com balanced accuracy: [0.90555556 0.97180064 0.96067416] misclassification: 8 recall: [0.98747063 0.98981191 0.9945098 ] precision: [0.99146515 0.99446995 0.99369643] f1: [0.98285934 0.99449994 0.9903826 ] Family # 1: FakeInstaller accuracy: [0.97137931 0.97059561 0.9776489 ] balanced accuracy: [0.95636344 0.96700107 0.97701111] misclassification: 8 recall: [0.97137931 0.97059561 0.9776489 ] precision: [0.97057825 0.96662691 0.97764725] f1: [0.97213876 0.97058114 0.9784326 ] Family # 2: Plankton accuracy: [0.9784326 0.9784326 0.9792163] balanced accuracy: [0.97667272 0.97047187 0.97954628] misclassification: 3 recall: [0.9784326 0.9784326 0.9853673 ] precision: [0.97921701 0.97922078 0.98145464 ] f1: [0.97921725 0.97606713 0.98473473 ] … Family # 23: Boxer accuracy: [0.97373041 0.97529781 0.97373041] balanced accuracy: [0.5 0.57142857 0.57064055] misclassification: 4 recall: [0.97451411 0.97529781 0.97373041] precision: [0.96905831 0.97257331 0.97165532] f1: [0.9713867 0.97299564 0.97376927]
  • 20. www.robertodaguarcino.com … I am not going to print out results of SVM classifier for each family, because it is not so important for our multiclass classification purposes, so it follows multiclass classification problems results for both Random Forest and SVM after and before some consideration. Then, we can come back to the multiclass classification to analysis its results. The script I used to test the classifier implemented cross-validation and many other techniques to avoid overfitting and to balance bias and variance. I was skeptical of the relatively high precision, recall, and F1 score recorded of the single binary classifiers, but looking through the script, I saw that the random seed for the cross-validation split was set at some value in order to generate reproducible results. I changed the random seed and sure enough, the performance of my model decreased. Therefore, I must have made the classic mistake of overfitting on my training set for the given cross-validation random seed and changed the seed again. Taking all of these precautions against overfitting, I had optimized my model for a nonspecific set of data. In order to get a better indicator of the performance of the Decision Tree model, I ran 10 tests with different random seeds and found the average performance metrics. The final results for my model are summarized below: Random Forest: Average Accuracy Precision Recall F1 weighted 0.77 0.97 0.77 0.86 micro 0.76 0.76 0.76 0.76 macro 0.78 0.94 0.66 0.75 Linear SVC: Average Accuracy Precision Recall F1 weighted 0.82 0.87 0.84 0.86 micro 0.81 0.81 0.81 0.81 macro 0.81 0.86 0.84 0.85
  • 21. www.robertodaguarcino.com The results are good given the nature of the classification. We can instantly notice that SVM has quite better results than Random Forest. Due to the small size of the available data, even a minor change such as altering the random seed when doing a train-test split can have significant effects on the performance of the algorithm which must be accounted for by testing over many subsets and calculating the average performance. This project points out and highlights the importance and goodness of machine learning while trying to classify complex, intricate and artificial objects such as a malware, an activity that is almost always too difficult (if not simply impossible) to be done by humans independently. Roberto Falconi