Malware Detection using n-grams
and Evaluation using Machine
Learning Algorithms
11MSE0195-SHERIN JOSEPHIN B
Abstract
• Computer security has been a major concern in today's
scenario. The term Malware is used to denote bad
software which hacks the computer security in the
present world.
• While most of the anti-virus software fails to detects
new virus. Thus n-grams as file signature can help us to
detect own malware and reduce false positive ratio.
• Further the dataset is optimized by using feature
selection algorithm. The final Featured Vector Table
obtained from feature selection and dimension
reduction will be compared and evaluated using
various machine learning algorithms.
Aim and Scope
• The aim of this project is to detect malware files using n-
gram analysis and evaluate it using machine learning
algorithm.
• As many antivirus software fails to detect new virus, using
n-gram as a model, will detect malware files efficiently.
• This project will focus on developing a better tool to detect
the malware files taking into consideration space
complexity.
• It is currently used in industries. Every industry mainly
focuses on securing the data. Anti-virus software like
Kaspersky, K7 uses this technique to detect malware files.
LITERATURE SURVEY
LITERATURE SURVEY...
LITERATURE SURVEY...
S.N TITLE ABSTRACT TECHNIQUES ADVANTAGES
8. “Static Malware Detection
with Segmented
Sandboxing”
This is study is about Taking
the best part from both
static and dynamic detection
approach, which is called
“Segmented Sandboxing” is
applied to detect malware
files.
1. segmented
sandboxing
Higher detection rate
(compare previous data)
9. .,“N grams based file
signature for malware
detection”.
This study proposes the use
of n-grams as file signatures
in order to detect unknown
malware
1.n-grams low false positive ratio.
10. “A Hybrid Model to
Detect Malicious
Executables”.
This paper proposes
featuthe re set is called
hybrid feature set which is
given to support vector
machine which classify
malware and benign files.
1.n-grams
2.SVM
1.high accuracy
2. low false positive rate
11. Detection of New
Malicious Code Using N-
grams Signatures”.
This paper says about the n-
gram analysis that classify
the malware and benign .
1.n-grams 1. efficient
2. Scalable
3. practical solutions
Architecture
Detailed Design
Module Description
MODULE 1: Dataset preparation
-executable files (benign or malware file) are disassembled using a
disassembler.
-assembly code is parsed. The opcode sequence is collected in Dataset.
MODULE 2 : Create Feature Vector Table( FVT )by n-grams extraction
- Dataset is classified as Training data and Testing data.
- The training data is used for n-gram extraction.
- These extracted n-grams are stored in a table called Feature Vector Table
(FVT).
- Feature Vector Table consists opcode, its frequency count and respective
class
MODULE 3 : Employing Feature Reduction Algorithm
- PCA
MODULE 4: Classification using Machine Learning Algorithm
- J48,Support Vector Machine(SVM) and Random Forest
UML Design
•USE CASE DIAGRAM
•CLASS DIAGRAM
•SEQUENCE DIAGRAM
•ACTIVITY DIAGRAM
•STATE CHART DIAGRAM
USE CASE DIAGRAM
CLASS DIAGRAM
Sequence Diagram
Activity Diagram
State Traction
Results and Discussion
With PCA Without PCA
2 grams 8 216
3 grams 9 256
4 grams 8 256
With Feature Selection Algorithm
2-grams Random Forest SVM J48
Classified 95% 82.50% 88%
Misclassified 12.30% 82.50% 36.40%
Precision 95.00% 68.10% 86.90%
Performance Table for 2grams
Graphic view for 2grams
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Random Forest SVM J48
TPR
FPR
Precision
Performance Table for 3grams
3-grams Random Forest SVM J48
Classified 92% 94.70% 84%
Misclassified 52.10% 34.70% 53.20%
Precision 92.80% 95.00% 84.20%
Graphic view for 3grams
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Random Forest SVM J48
TPR
FPR
Precision
Normal Code and Obfuscated Code
Disassembling the executables
Parser
N-grams extraction
Opcode and its frequency and class
Data set
Before PCA
After Feature Selection- PCA
Classification
Today

Today

  • 1.
    Malware Detection usingn-grams and Evaluation using Machine Learning Algorithms 11MSE0195-SHERIN JOSEPHIN B
  • 2.
    Abstract • Computer securityhas been a major concern in today's scenario. The term Malware is used to denote bad software which hacks the computer security in the present world. • While most of the anti-virus software fails to detects new virus. Thus n-grams as file signature can help us to detect own malware and reduce false positive ratio. • Further the dataset is optimized by using feature selection algorithm. The final Featured Vector Table obtained from feature selection and dimension reduction will be compared and evaluated using various machine learning algorithms.
  • 3.
    Aim and Scope •The aim of this project is to detect malware files using n- gram analysis and evaluate it using machine learning algorithm. • As many antivirus software fails to detect new virus, using n-gram as a model, will detect malware files efficiently. • This project will focus on developing a better tool to detect the malware files taking into consideration space complexity. • It is currently used in industries. Every industry mainly focuses on securing the data. Anti-virus software like Kaspersky, K7 uses this technique to detect malware files.
  • 4.
  • 5.
  • 6.
    LITERATURE SURVEY... S.N TITLEABSTRACT TECHNIQUES ADVANTAGES 8. “Static Malware Detection with Segmented Sandboxing” This is study is about Taking the best part from both static and dynamic detection approach, which is called “Segmented Sandboxing” is applied to detect malware files. 1. segmented sandboxing Higher detection rate (compare previous data) 9. .,“N grams based file signature for malware detection”. This study proposes the use of n-grams as file signatures in order to detect unknown malware 1.n-grams low false positive ratio. 10. “A Hybrid Model to Detect Malicious Executables”. This paper proposes featuthe re set is called hybrid feature set which is given to support vector machine which classify malware and benign files. 1.n-grams 2.SVM 1.high accuracy 2. low false positive rate 11. Detection of New Malicious Code Using N- grams Signatures”. This paper says about the n- gram analysis that classify the malware and benign . 1.n-grams 1. efficient 2. Scalable 3. practical solutions
  • 7.
  • 8.
  • 9.
    Module Description MODULE 1:Dataset preparation -executable files (benign or malware file) are disassembled using a disassembler. -assembly code is parsed. The opcode sequence is collected in Dataset. MODULE 2 : Create Feature Vector Table( FVT )by n-grams extraction - Dataset is classified as Training data and Testing data. - The training data is used for n-gram extraction. - These extracted n-grams are stored in a table called Feature Vector Table (FVT). - Feature Vector Table consists opcode, its frequency count and respective class MODULE 3 : Employing Feature Reduction Algorithm - PCA MODULE 4: Classification using Machine Learning Algorithm - J48,Support Vector Machine(SVM) and Random Forest
  • 10.
    UML Design •USE CASEDIAGRAM •CLASS DIAGRAM •SEQUENCE DIAGRAM •ACTIVITY DIAGRAM •STATE CHART DIAGRAM
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
    Results and Discussion WithPCA Without PCA 2 grams 8 216 3 grams 9 256 4 grams 8 256 With Feature Selection Algorithm
  • 17.
    2-grams Random ForestSVM J48 Classified 95% 82.50% 88% Misclassified 12.30% 82.50% 36.40% Precision 95.00% 68.10% 86.90% Performance Table for 2grams
  • 18.
    Graphic view for2grams 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Random Forest SVM J48 TPR FPR Precision
  • 19.
    Performance Table for3grams 3-grams Random Forest SVM J48 Classified 92% 94.70% 84% Misclassified 52.10% 34.70% 53.20% Precision 92.80% 95.00% 84.20%
  • 20.
    Graphic view for3grams 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Random Forest SVM J48 TPR FPR Precision
  • 21.
    Normal Code andObfuscated Code
  • 22.
  • 23.
  • 24.
  • 25.
    Opcode and itsfrequency and class
  • 26.
  • 27.
  • 28.
  • 29.