Malware analysis on android using supervised machine learning techniques

2018 International Conference on Intelligent Information Technology (ICIIT 2018)
1
Date : February, 2018
Location : Vietnam
Presented by:
Md. Shohel Rana
Instructed by:
Andrew H. Sung
MALWARE ANALYSIS ON ANDROID USING SUPERVISED MACHINE
LEARNING TECHNIQUES

Contents
 Motivation and Contribution
 Android with Security Mechanism
 Malware
 Techniques for threat analysis
 Static analysis, Dynamic analysis, Machine Learning
 Supervised Machine Learning
 Machine Learning – Malware analysis
 Work
 Methodology, Experiment, Results, Discussion
 Conclusion and future work
 References
2

Motivation and Contribution
 Android, an OS has over one billion active users for all their mobile
devices, with a market impact that is influencing an increase in the
amount of information obtained from different users, facts that have
motivated the development of malware by cybercriminals
 To solve the problems caused by malware, Android implements a
different architecture and security controls, such as unique user ID
(UID) for each application, system permissions, and its distribution
platform Google Play
 There are many ways to violate that protection, and how the
complexity for creating a new solution are increased while
cybercriminals improve their skills to develop malware
3

Motivation and Contribution (Cont.)
 The community of developers and researchers has been developing
alternatives aimed at improving the level of safety where some
solutions have been proposed: analysis techniques, frameworks,
sandboxes, and systems security
 One of the most promising way is the implementation of artificial
intelligence solutions for malware analysis
 This work proposed a module that implements a static analysis
framework with six machine learning algorithms for detecting
malware on Android
4

Android with Security Mechanism
 Open source based on the Linux kernel based operating system for mobile
devices having more than one billion active users [1][2]
 Applications are compiled into an Android Application Package (APK) file
 Source code (.dex files)
 AndroidManifest.xml
 Provides information on the characteristics and security settings of each application
[4].
 API permissions
 Activities
 Services
 Content providers
 Broadcast receivers
5

Android with Security Mechanism
(cont.) Security mechanisms
 Kernel-level sandbox environment to prevent file-system access
 Application level permissions API [3]
 Security tools at the application development level
 Google Play digital distribution platform
6

Malware
 Malicious software, is development that is projected to harm users who use
it
 Trojan horse: A Trojan horse or Trojan is a type of malware that is often disguised as legitimate software
 Spyware: Designed to gather data from a computer or other device and forward it to a third party without the
consent or knowledge of the user
 Worms: A computer worm is a standalone malware computer program that replicates itself in order to spread to
other computers. Often, it uses a computer network to spread itself, relying on security failures on the target
computer to access it
 Root-kit: A rootkit is a program designed to provide hackers with administrative access to your computer
without you knowing. It can be used to remotely control a device. A rootkit hides its presence on a PC, usually
within some of the lower layers of the operating system
 Ransomware: Ransomware is a subset of malware in which the data on a victim's computer is locked, typically
by encryption, and payment is demanded before the ransomed data is decrypted and access returned to the
victim
 Modify the truthful action of the device
 Gain sensitive information from users
 First Trojan found in Android in 2010 [5]
7

Malware (cont.)
 The bellow figure shows the number of attacks blocked by Kaspersky
lab solutions in 2015 [6]
8

Techniques for Threat Analysis
 According to project ‘SafeCandy’ [7]
 Static analysis
 Dynamic analysis
 Artificial intelligence - machine learning
9

Techniques for Threat Analysis (cont.)
 Static analysis:
 It is a technique that evaluates malicious behaviors in source code, data or
binary files without running the application directly [8]
 Its complexity has increased due to the experience acquired by cybercriminals
in the development of applications and it has been shown that it is possible to
avoid it from obfuscation techniques [9]
 Dynamic analysis:
 They are methods that study the behavior of malware in execution through
gesture simulation; in this technique we analyze the running processes, the user
interface, network connections, sockets opening, among others [10]
 On the other hand, already exist techniques that avoid the process realized by
the dynamic analysis, where the malware has the ability to detect environments
sandbox and to stop its malicious behavior [11]
10

Techniques for Threat Analysis (cont.)
 Artificial intelligence – machine learning
 Supervised learning
 Unsupervised learning
11

Supervised Machine Learning
 Builds a model to make predictions using a known set of input
data and responses by training a model to make predictions for
the response to new data
 Supervised machine learning splits into two broad categories [12]-
[14]:
 Classification (When the algorithm uses the data to predict a category),
 Regression (When a value is used to predict a continuous measurement for
an observation)
 Common classification algorithms
 Support vector machines (SVM), Neural networks, Naïve Bayes classifier,
Decision trees, Discriminant analysis, K-Nearest neighbors (k-NN)
12

Supervised Machine Learning (Cont.)
 Common regression algorithms:
 Linear regression, Nonlinear regression, Generalized linear models, Decision
trees, Neural networks
 Steps in Supervised Learning
1. Determine the type of training
2. Gather a training set
3. (i) Determine the input feature representation of the learned function and
structure
(ii) Determine the structure of the learned function and corresponding
learning algorithm
4. Complete the design
5. Evaluate the accuracy of the learned function
13

Machine Learning - Malware
 Both static and dynamic analysis assist to obtain the data which
will then be processed by the algorithms of Machine Learning
 The datasets must be balanced
 Most of the studies use the malware dataset of the ‘MalGenome’
project [13] and a crawler [14] for the download of benign
applications
14

Methodology
In our methodology section we build a model using a study of static
analysis of the application configuration permissions containing train
and test datasets, assessment standards, the outcomes of the
experimentation of these algorithms
15
Table1. Dataset Description
Android Malware Training Set Test Set
Value Count Percent Count Percent Count Percent
0 199 50.00% 116 48.54% 83 52.20%
1 199 50.00% 123 51.46% 76 47.80%

Methodology (cont.)
The proposed framework is divided into five steps:
1.Gather a training set: A total of 558 APKs were collected where 279 low-privilege
applications of the open source dataset and a random selection of 279
‘MalGenome’ malware.
2.Extraction: To attain the AndroidManifest.xml file from the collected APK we used
the ‘ApkTool 2.3.0’ tool (Released-21 Sep 2017)
3.Features vector: We used MATLAB (2017a) for the development purpose with
creating the binary vectors by accessing of every application’s permission, and a
label of application classification was added to determine whether the
application as benign or malicious. Each application is allotted a vector defined
by V = (R1, R2, ..., R330, C) where,
16
�� = {0,�� 𝑎�𝑜� ℎ𝑒� 𝑐𝑎𝑠𝑒
1,if the application is malicious
�� = {0,�� 𝑎�𝑜� ℎ𝑒� 𝑐𝑎𝑠𝑒
1,�� 𝑑𝑒�𝑒𝑐�𝑒𝑑 𝑎� 𝑎𝑐𝑐𝑒𝑠𝑠 𝑝𝑒�𝑚�𝑠𝑠�𝑜�

Methodology (cont.)
4. Training and testing: For our experiment, we divided the dataset into two
sections where 60% of the data used for training and 40% of the data, selected
randomly, for testing purpose.
5. Classification: After training and testing using these supervised learning
algorithms we reached at position with classifying the applications as benign or
malicious shown in below figure.
Fig. 1. A static analysis framework to detect malware on Android
17

Measurement Metrics
 A confusion matrix is a matrix which contains information about actual and
predicted classifications in Table 2 done by a classification to measure the
algorithm performance using the matrix data [18] - [19]
 Accuracy (AC) is the proportion of the total number of corrected
predictions. Overall, how often is the classifier correct
 Misclassification Rate (1-Accuracy) describe the overall, how often is it
wrong
18
𝐴𝑐𝑐𝑢�𝑎𝑐� =
𝑁𝑜. 𝑜� 𝑐𝑜��𝑒𝑐�𝑙� 𝑐𝑙𝑎𝑠𝑠��𝑒𝑑 𝑑𝑎�𝑎 (𝑇� + 𝑇𝑁)
𝑇𝑜�𝑎𝑙
𝑚�𝑠𝑐𝑙𝑎𝑠𝑠��𝑐𝑎��𝑜� =
𝑁𝑜. 𝑜� 𝑚�𝑠𝑐𝑙𝑎𝑠𝑠��𝑒𝑑 𝑑𝑎�𝑎 (𝐹� + 𝐹𝑁)
𝑇𝑜�𝑎𝑙
Table2. Confusion Matrix
Actual class
Positive Negative
Predicted Class
Positive True Positive (TP) False Positive (FP)
Negative False Negative (FN) True Negative (TN)

Measurement Metrics (CONT.)
 Precision (P) is the proportion of the correctly predicted positive
cases determined by
 The recall or true positive rate (TP) is the proportion of the correctly
identified positive cases defined by
 f1- Score is a weighted average of the True Positive (TP) rate (recall)
and Precision (P)
 ROC Curve is a graph to summarize the performance of the classifier
over all probable thresholds generated by plotting the True Positive
(TP) Rate in Y-axis against the False Positive (FP) Rate in X-axis
19
�𝑒𝑐𝑎𝑙𝑙 =
𝑇 �
𝑇 � + 𝐹 𝑁
𝑝�𝑒𝑐�𝑠�𝑜� =
𝑇 �
𝑇 � + 𝐹 �

Experiment and result
 Technologies used
 The experiments are done using MATLAB (2017a) on the machine where the
System Model was MacBookPr011.5, Processor was Intel(R) Core(TM) i7-4870HQ
CPU @ 2.50GHz (8 CPUs), ~2.5GHz 64-bit PC.
20

Experiment and result (Cont.)
Table 3. Result of the performance of Supervised Machine Learning algorithms
Algorithms
Precision Recall Misclassification rate
Accuracy0 1 0 1 0 1
NB 0 .8 5 0 .8 7 0 .8 2 0 .8 3 0 .1 5 0 .1 7 0 .8 4
DA 0 .8 1 0 .8 4 0 .8 9 0 .8 9 0 .1 4 0 .1 5 0 .8 6
kNN 0 .9 6 0 .9 6 0 .9 7 0 .9 6 0 .0 4 0 .0 4 0 .9 6
DT 0 .9 3 0 .9 3 0 .9 3 0 .9 2 0 .0 6 0 .0 5 0 .9 4
NN 0 .8 7 0 .9 0 0 .8 5 0 .8 9 0 .1 2 0 .1 0 0 .8 9
SVM 0 .9 1 0 .9 3 0 .9 2 0 .9 4 0 .0 8 0 .0 6 0 .9 3
21

Experiment and result (Cont.)
22
Fig. 2. ROC Curve Fig. 3. Misclassification rate

Result and Discussion
From the results we get, Naive Bayes and Discriminant analysis give the worst
results both in the classification for benign and malicious behavior and the
performances of them are 84% and 86% respectively. The neural network
algorithm gives the average result with 89% performance. Decision Tree and
SVM algorithms provide almost similar results with 94% and 93% performances
respectively. Finally, the algorithm that presented the best results for our
dataset was K-Nearest Neighbors both in the classification (96% for benign
software classification and 96% for malware) and overall performance is 96%.
The above results can be seen through the ROC curve (Fig. 2) and the
misclassification curve (Fig. 3).
23

Conclusion and future work
 In this article we propose the assessment numerous supervised
machine learning algorithm to detect malware on Android from their
access permissions. As future work suggestions, we will try to do the
following:
 Assess trained classifiers in contradiction of another set of applications with over
privileges.
 Study and Train other artificial intelligence techniques.
 Implement a system that will allow the algorithm to learn from wicked
classifications.
 Discover other features of Android for the training and testing
24

References
[1] Pichai, S. Google I/O 2014 - Keynote [video. 6:43m], https://www.google.com/events/io. June 2014.
[2] Elenkov, N. Android Security Internals: An In-depth Guide to Android's Security Architecture. No Starch Press.
October 2014.
[3] Drake, J. J., Lanier, Z., Mulliner, C., Fora, P. O., Ridley, S. A., & Wicherski, G. Android Hacker's Handbook. John
Wiley & Sons, Indianapolis. March 2014.
[4] Peiravian, N., & Zhu, X. Machine learning for android malware detection using permission and api calls. In Tools
with Artificial Intelligence (ICTAI), 2013 IEEE 25th International Conference on (pp. 300-305). IEEE. November 2013.
[5] Plaza, E. Kaspersky detecta el primer troyano para Android - Wayerless.
http://www.wayerless.com/2010/08/kaspersky-detecta-el-primer-troyano-para-android. Agust 2010.
[6] Unuchek, R., & Chebyshev, V. Mobile malware evolution 2015. https://securelist.com/analysis/kaspersky-
security-bulletin/73839/mobile-malware-evolution-2015/. Feb 2016.
[7] Londoño, S., Urcuqui, C. C., Amaya, M. F., Gómez, J., & Cadavid, A. N. (2015). SafeCandy: System for security,
analysis and validation in Android.Sistemas y Telemática, 13(35), 89-102.
[8] Batyuk, L., Herpich, M., Camtepe, S. A., Raddatz, K., Schmidt, A., & Albayrak, S. Using static analysis for
automatic assessment and mitigation of unwanted and malicious activities within Android applications. Malicious
and Unwanted Software (MALWARE), 2011 6th International Conference on. IEEE, Piscataway. 2011.
25

References
[9] Sharif, M. I., Lanzi, A., Giffin, J. T., & Lee, W. Impeding Malware Analysis Using Conditional Code Obfuscation. In
NDSS. February 2008.
[10] Drake, J. J., Lanier, Z., Mulliner, C., Fora, P. O., Ridley, S. A., & Wicherski, G. Android Hacker's Handbook. John
Wiley & Sons, Indianapolis. 2014.
[11] Petsas, T., Voyatzis, G., Athanasopoulos, E., Polychronakis, M., & Ioannidis, S. Rage against the virtual
machine: hindering dynamic analysis of android malware. In Proceedings of the Seventh European Workshop on
System Security (p. 5). ACM. April 2014.
[12] Supervised Learning Workflow and Algorithms, Retrieved December 8, 2017, Official Address:
https://www.mathworks.com/help/stats/supervised-learning-machine-learning-workflow-and-algorithms.html.
[13] Supervised Learning, Retrieved: December 8, 2017, Official Address:
https://www.mathworks.com/discovery/supervised-learning.html.
[14] Supervised Learning, Retrieved: December 8, 2017, Official Address:
https://en.wikipedia.org/wiki/Supervised_learning.
26

References
[15] Yajin Zhou, Xuxian Jiang, "Dissecting Android Malware: Characterization and Evolution," Proceedings of the
33rd IEEE Symposium on Security and Privacy (Oakland 2012), San Francisco, CA, May 2012.
[16] Narudin, F. A., Feizollah, A., Anuar, N. B., & Gani, A. Evaluation of machine learning classifiers for mobile
malware detection. Soft Computing, 1-15. 2014.
[17] Krutz, D. E., Mirakhorli, M., Malachowsky, S. A., Ruiz, A., Peterson, J., Filipski, A., & Smith, J. (2015, May). A
dataset of open-source Android applications. In Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working
Conference on (pp. 522-525). IEEE.
[18] Confusion Matrix, Retrieved: December 8, 2017, Official Address:
http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/ confusion_matrix.html
[19] Simple guide to confusion matrix terminology, Retrieved: December 8, 2017, Official Address:
http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
27

Question?
28

Malware analysis on android using supervised machine learning techniques

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Malware analysis on android using supervised machine learning techniques

Similar to Malware analysis on android using supervised machine learning techniques (20)

More from Md. Shohel Rana

More from Md. Shohel Rana (7)

Recently uploaded

Recently uploaded (20)

Malware analysis on android using supervised machine learning techniques