In recent years, a widespread research is conducted with the growth of malware resulted in the domain of malware analysis and detection in Android devices. Android, a mobile-based operating system currently having more than one billion active users with a high market impact that have inspired the expansion of malware by cyber criminals. Android implements a different architecture and security controls to solve the problems caused by malware, such as unique user ID (UID) for each application, system permissions, and its distribution platform Google Play. There are numerous ways to violate that fortification, and how the complexity of creating a new solution is enlarged while cybercriminals progress their skills to develop malware. A community including developer and researcher has been evolving substitutes aimed at refining the level of safety where numerous machine learning algorithms already been proposed or applied to classify or cluster malware including analysis techniques, frameworks, sandboxes, and systems security. One of the most promising techniques is the implementation of artificial intelligence solutions for malware analysis. In this paper, we evaluate numerous supervised machine learning algorithms by implementing a static analysis framework to make predictions for detecting malware on Android.
APM Welcome, APM North West Network Conference, Synergies Across Sectors
Malware analysis on android using supervised machine learning techniques
1. 2018 International Conference on Intelligent Information Technology (ICIIT 2018)
1
Date : February, 2018
Location : Vietnam
Presented by:
Md. Shohel Rana
Instructed by:
Andrew H. Sung
MALWARE ANALYSIS ON ANDROID USING SUPERVISED MACHINE
LEARNING TECHNIQUES
2. Contents
Motivation and Contribution
Android with Security Mechanism
Malware
Techniques for threat analysis
Static analysis, Dynamic analysis, Machine Learning
Supervised Machine Learning
Machine Learning – Malware analysis
Work
Methodology, Experiment, Results, Discussion
Conclusion and future work
References
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
2
3. Motivation and Contribution
Android, an OS has over one billion active users for all their mobile
devices, with a market impact that is influencing an increase in the
amount of information obtained from different users, facts that have
motivated the development of malware by cybercriminals
To solve the problems caused by malware, Android implements a
different architecture and security controls, such as unique user ID
(UID) for each application, system permissions, and its distribution
platform Google Play
There are many ways to violate that protection, and how the
complexity for creating a new solution are increased while
cybercriminals improve their skills to develop malware
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
3
4. Motivation and Contribution (Cont.)
The community of developers and researchers has been developing
alternatives aimed at improving the level of safety where some
solutions have been proposed: analysis techniques, frameworks,
sandboxes, and systems security
One of the most promising way is the implementation of artificial
intelligence solutions for malware analysis
This work proposed a module that implements a static analysis
framework with six machine learning algorithms for detecting
malware on Android
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
4
5. Android with Security Mechanism
Open source based on the Linux kernel based operating system for mobile
devices having more than one billion active users [1][2]
Applications are compiled into an Android Application Package (APK) file
Source code (.dex files)
AndroidManifest.xml
Provides information on the characteristics and security settings of each application
[4].
API permissions
Activities
Services
Content providers
Broadcast receivers
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
5
6. Android with Security Mechanism
(cont.) Security mechanisms
Kernel-level sandbox environment to prevent file-system access
Application level permissions API [3]
Security tools at the application development level
Google Play digital distribution platform
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
6
7. Malware
Malicious software, is development that is projected to harm users who use
it
Trojan horse: A Trojan horse or Trojan is a type of malware that is often disguised as legitimate software
Spyware: Designed to gather data from a computer or other device and forward it to a third party without the
consent or knowledge of the user
Worms: A computer worm is a standalone malware computer program that replicates itself in order to spread to
other computers. Often, it uses a computer network to spread itself, relying on security failures on the target
computer to access it
Root-kit: A rootkit is a program designed to provide hackers with administrative access to your computer
without you knowing. It can be used to remotely control a device. A rootkit hides its presence on a PC, usually
within some of the lower layers of the operating system
Ransomware: Ransomware is a subset of malware in which the data on a victim's computer is locked, typically
by encryption, and payment is demanded before the ransomed data is decrypted and access returned to the
victim
Modify the truthful action of the device
Gain sensitive information from users
First Trojan found in Android in 2010 [5]
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
7
8. Malware (cont.)
The bellow figure shows the number of attacks blocked by Kaspersky
lab solutions in 2015 [6]
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
8
9. Techniques for Threat Analysis
According to project ‘SafeCandy’ [7]
Static analysis
Dynamic analysis
Artificial intelligence - machine learning
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
9
10. Techniques for Threat Analysis (cont.)
Static analysis:
It is a technique that evaluates malicious behaviors in source code, data or
binary files without running the application directly [8]
Its complexity has increased due to the experience acquired by cybercriminals
in the development of applications and it has been shown that it is possible to
avoid it from obfuscation techniques [9]
Dynamic analysis:
They are methods that study the behavior of malware in execution through
gesture simulation; in this technique we analyze the running processes, the user
interface, network connections, sockets opening, among others [10]
On the other hand, already exist techniques that avoid the process realized by
the dynamic analysis, where the malware has the ability to detect environments
sandbox and to stop its malicious behavior [11]
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
10
11. Techniques for Threat Analysis (cont.)
Artificial intelligence – machine learning
Supervised learning
Unsupervised learning
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
11
12. Supervised Machine Learning
Builds a model to make predictions using a known set of input
data and responses by training a model to make predictions for
the response to new data
Supervised machine learning splits into two broad categories [12]-
[14]:
Classification (When the algorithm uses the data to predict a category),
Regression (When a value is used to predict a continuous measurement for
an observation)
Common classification algorithms
Support vector machines (SVM), Neural networks, Naïve Bayes classifier,
Decision trees, Discriminant analysis, K-Nearest neighbors (k-NN)
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
12
13. Supervised Machine Learning (Cont.)
Common regression algorithms:
Linear regression, Nonlinear regression, Generalized linear models, Decision
trees, Neural networks
Steps in Supervised Learning
1. Determine the type of training
2. Gather a training set
3. (i) Determine the input feature representation of the learned function and
structure
(ii) Determine the structure of the learned function and corresponding
learning algorithm
4. Complete the design
5. Evaluate the accuracy of the learned function
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
13
14. Machine Learning - Malware
Both static and dynamic analysis assist to obtain the data which
will then be processed by the algorithms of Machine Learning
The datasets must be balanced
Most of the studies use the malware dataset of the ‘MalGenome’
project [13] and a crawler [14] for the download of benign
applications
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
14
15. Methodology
In our methodology section we build a model using a study of static
analysis of the application configuration permissions containing train
and test datasets, assessment standards, the outcomes of the
experimentation of these algorithms
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
15
Table1. Dataset Description
Android Malware Training Set Test Set
Value Count Percent Count Percent Count Percent
0 199 50.00% 116 48.54% 83 52.20%
1 199 50.00% 123 51.46% 76 47.80%
16. Methodology (cont.)
The proposed framework is divided into five steps:
1.Gather a training set: A total of 558 APKs were collected where 279 low-privilege
applications of the open source dataset and a random selection of 279
‘MalGenome’ malware.
2.Extraction: To attain the AndroidManifest.xml file from the collected APK we used
the ‘ApkTool 2.3.0’ tool (Released-21 Sep 2017)
3.Features vector: We used MATLAB (2017a) for the development purpose with
creating the binary vectors by accessing of every application’s permission, and a
label of application classification was added to determine whether the
application as benign or malicious. Each application is allotted a vector defined
by V = (R1, R2, ..., R330, C) where,
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
16
�� = {0,�� 𝑎�𝑜� ℎ𝑒� 𝑐𝑎𝑠𝑒
1,if the application is malicious
�� = {0,�� 𝑎�𝑜� ℎ𝑒� 𝑐𝑎𝑠𝑒
1,�� 𝑑𝑒�𝑒𝑐�𝑒𝑑 𝑎� 𝑎𝑐𝑐𝑒𝑠𝑠 𝑝𝑒�𝑚�𝑠𝑠�𝑜�
17. Methodology (cont.)
4. Training and testing: For our experiment, we divided the dataset into two
sections where 60% of the data used for training and 40% of the data, selected
randomly, for testing purpose.
5. Classification: After training and testing using these supervised learning
algorithms we reached at position with classifying the applications as benign or
malicious shown in below figure.
Fig. 1. A static analysis framework to detect malware on Android
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
17
18. Measurement Metrics
A confusion matrix is a matrix which contains information about actual and
predicted classifications in Table 2 done by a classification to measure the
algorithm performance using the matrix data [18] - [19]
Accuracy (AC) is the proportion of the total number of corrected
predictions. Overall, how often is the classifier correct
Misclassification Rate (1-Accuracy) describe the overall, how often is it
wrong
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
18
𝐴𝑐𝑐𝑢�𝑎𝑐� =
𝑁𝑜. 𝑜� 𝑐𝑜��𝑒𝑐�𝑙� 𝑐𝑙𝑎𝑠𝑠���𝑒𝑑 𝑑𝑎�𝑎 (𝑇� + 𝑇𝑁)
𝑇𝑜�𝑎𝑙
𝑚�𝑠𝑐𝑙𝑎𝑠𝑠���𝑐𝑎��𝑜� =
𝑁𝑜. 𝑜� 𝑚�𝑠𝑐𝑙𝑎𝑠𝑠���𝑒𝑑 𝑑𝑎�𝑎 (𝐹� + 𝐹𝑁)
𝑇𝑜�𝑎𝑙
Table2. Confusion Matrix
Actual class
Positive Negative
Predicted Class
Positive True Positive (TP) False Positive (FP)
Negative False Negative (FN) True Negative (TN)
19. Measurement Metrics (CONT.)
Precision (P) is the proportion of the correctly predicted positive
cases determined by
The recall or true positive rate (TP) is the proportion of the correctly
identified positive cases defined by
f1- Score is a weighted average of the True Positive (TP) rate (recall)
and Precision (P)
ROC Curve is a graph to summarize the performance of the classifier
over all probable thresholds generated by plotting the True Positive
(TP) Rate in Y-axis against the False Positive (FP) Rate in X-axis
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
19
�𝑒𝑐𝑎𝑙𝑙 =
𝑇 �
𝑇 � + 𝐹 𝑁
𝑝�𝑒𝑐�𝑠�𝑜� =
𝑇 �
𝑇 � + 𝐹 �
20. Experiment and result
Technologies used
The experiments are done using MATLAB (2017a) on the machine where the
System Model was MacBookPr011.5, Processor was Intel(R) Core(TM) i7-4870HQ
CPU @ 2.50GHz (8 CPUs), ~2.5GHz 64-bit PC.
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
20
22. Experiment and result (Cont.)
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
22
Fig. 2. ROC Curve Fig. 3. Misclassification rate
23. Result and Discussion
From the results we get, Naive Bayes and Discriminant analysis give the worst
results both in the classification for benign and malicious behavior and the
performances of them are 84% and 86% respectively. The neural network
algorithm gives the average result with 89% performance. Decision Tree and
SVM algorithms provide almost similar results with 94% and 93% performances
respectively. Finally, the algorithm that presented the best results for our
dataset was K-Nearest Neighbors both in the classification (96% for benign
software classification and 96% for malware) and overall performance is 96%.
The above results can be seen through the ROC curve (Fig. 2) and the
misclassification curve (Fig. 3).
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
23
24. Conclusion and future work
In this article we propose the assessment numerous supervised
machine learning algorithm to detect malware on Android from their
access permissions. As future work suggestions, we will try to do the
following:
Assess trained classifiers in contradiction of another set of applications with over
privileges.
Study and Train other artificial intelligence techniques.
Implement a system that will allow the algorithm to learn from wicked
classifications.
Discover other features of Android for the training and testing
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
24
25. References
[1] Pichai, S. Google I/O 2014 - Keynote [video. 6:43m], https://www.google.com/events/io. June 2014.
[2] Elenkov, N. Android Security Internals: An In-depth Guide to Android's Security Architecture. No Starch Press.
October 2014.
[3] Drake, J. J., Lanier, Z., Mulliner, C., Fora, P. O., Ridley, S. A., & Wicherski, G. Android Hacker's Handbook. John
Wiley & Sons, Indianapolis. March 2014.
[4] Peiravian, N., & Zhu, X. Machine learning for android malware detection using permission and api calls. In Tools
with Artificial Intelligence (ICTAI), 2013 IEEE 25th International Conference on (pp. 300-305). IEEE. November 2013.
[5] Plaza, E. Kaspersky detecta el primer troyano para Android - Wayerless.
http://www.wayerless.com/2010/08/kaspersky-detecta-el-primer-troyano-para-android. Agust 2010.
[6] Unuchek, R., & Chebyshev, V. Mobile malware evolution 2015. https://securelist.com/analysis/kaspersky-
security-bulletin/73839/mobile-malware-evolution-2015/. Feb 2016.
[7] Londoño, S., Urcuqui, C. C., Amaya, M. F., Gómez, J., & Cadavid, A. N. (2015). SafeCandy: System for security,
analysis and validation in Android.Sistemas y Telemática, 13(35), 89-102.
[8] Batyuk, L., Herpich, M., Camtepe, S. A., Raddatz, K., Schmidt, A., & Albayrak, S. Using static analysis for
automatic assessment and mitigation of unwanted and malicious activities within Android applications. Malicious
and Unwanted Software (MALWARE), 2011 6th International Conference on. IEEE, Piscataway. 2011.
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
25
26. References
[9] Sharif, M. I., Lanzi, A., Giffin, J. T., & Lee, W. Impeding Malware Analysis Using Conditional Code Obfuscation. In
NDSS. February 2008.
[10] Drake, J. J., Lanier, Z., Mulliner, C., Fora, P. O., Ridley, S. A., & Wicherski, G. Android Hacker's Handbook. John
Wiley & Sons, Indianapolis. 2014.
[11] Petsas, T., Voyatzis, G., Athanasopoulos, E., Polychronakis, M., & Ioannidis, S. Rage against the virtual
machine: hindering dynamic analysis of android malware. In Proceedings of the Seventh European Workshop on
System Security (p. 5). ACM. April 2014.
[12] Supervised Learning Workflow and Algorithms, Retrieved December 8, 2017, Official Address:
https://www.mathworks.com/help/stats/supervised-learning-machine-learning-workflow-and-algorithms.html.
[13] Supervised Learning, Retrieved: December 8, 2017, Official Address:
https://www.mathworks.com/discovery/supervised-learning.html.
[14] Supervised Learning, Retrieved: December 8, 2017, Official Address:
https://en.wikipedia.org/wiki/Supervised_learning.
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
26
27. References
[15] Yajin Zhou, Xuxian Jiang, "Dissecting Android Malware: Characterization and Evolution," Proceedings of the
33rd IEEE Symposium on Security and Privacy (Oakland 2012), San Francisco, CA, May 2012.
[16] Narudin, F. A., Feizollah, A., Anuar, N. B., & Gani, A. Evaluation of machine learning classifiers for mobile
malware detection. Soft Computing, 1-15. 2014.
[17] Krutz, D. E., Mirakhorli, M., Malachowsky, S. A., Ruiz, A., Peterson, J., Filipski, A., & Smith, J. (2015, May). A
dataset of open-source Android applications. In Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working
Conference on (pp. 522-525). IEEE.
[18] Confusion Matrix, Retrieved: December 8, 2017, Official Address:
http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/ confusion_matrix.html
[19] Simple guide to confusion matrix terminology, Retrieved: December 8, 2017, Official Address:
http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
2018 International Conference on Intelligent Information Technology (ICIIT 2018)
27