The 7 Things I Know About Cyber Security After 25 Years | April 2024
Security Application for Malicious Code Detection using Data Mining
1. Security Applications for Malicious
Code Detection using Data Mining
Under the guidance of
Prof .Pawade A.S
Presented By
1. Kulkarni Eshwari
2. Annaldas Namrata
3. Yalameli Pravin
4. Shingade Karishma
5. Jena Suvendu
SAMCD
2. OVERVIEW
1. Introduction
What is Malicious Code?
Harmful effects of Malicious code
How Data Mining is useful for detecting
Malicious code
3. Problem Statement
4. Objective and Scope
5. Methodology
6. Vision of System Architecture
7. Algorithm
8. Flow of Project
9. Advantages
10. Conclusion
11. References
SAMCD
3. INTRODUCTION
What is Malicious Code?
Describe any code in any part of a software system or script that
is intended to cause undesired effects, security breaches or damage
to a system.
This malicious code is a rather simple virus, which
searches for
“ *.exe“
Exploits software vulnerability on a victim
contd..
SAMCD
4. Harmful effects of Malicious code?
Harm the confidentiality, integrity or availability of your
computer data or network and can potentially cause more harm in
terms of stealing your personal information.
May remotely infect other victims
Contd..
SAMCD
5. How Data Mining is useful for detecting Malicious code
Automatically design and build a scanner that accurately detects
malicious executable before they have been given a chance to run.
Data mining methods detect patterns in large amounts of data, such as
byte code, and use these patterns to detect future instances in similar data
Framework uses classifiers to detect new malicious executable.
A classifier is a rule set, or detection model, generated by the data
mining algorithm that was trained over a given set of training data.
SAMCD
6. The traditional detection accuracy (signature based) of malware is
ineffective, because of constantly changing of malware nature and shapes
through obfuscation techniques. Some feature representations are effective to
detect malicious code from huge historical data using classifiers, security and
learning algorithm such as RIPPER technology for higher performance
detection rate.
SAMCD
PROBLEM STATEMENT
7. To perform data pre-processing that will prepare appropriate format to
be input to Machine Learning classifiers.
To develop two representatives supervised machine learning models;
Such as Decision Tree.
To evaluate the performance of Support vector machine and artificial
neural network to classify for new malicious executable programs.
SAMCD
OBJECTIVE
8. Focus on malicious program that exists in Microsoft Windows as
experiment platform and VMware as virtual machine.
In this project, Supervised Machine learning techniques will be focused,
because it performs statistical comparisons on specific datasets to examine
the accuracies of trained classifiers.
SAMCD
SCOPE
9. DECISION TREE AND RULES:
They only work over a single table, and over a single attribute at a time.
Useful when the outcomes are uncertain
Allows comparison of different possible decisions to be made.
They are easily understandable. They build a model made up by rules
(Split Point).
They are one of the most used data mining techniques.
METHODOLOGY
contd..
SAMCD
10. Classification Example
Age Car Class
20 M Yes
30 M Yes
25 T No
30 S Yes
40 S Yes
20 T No
30 M Yes
25 M Yes
40 M Yes
20 S No
Suppose,
Two Predictor attributes:
Age and Car-type (Sport,
Minivan & Truck)
Age is ordered, Car-type is
categorical attribute
Class label indicates whether
person brought product
Dependent attribute is categorical
contd..
SAMCD
12. A decision tree is built top-down from a root node and involves
partitioning the data into subsets that contain instances with similar
values (homogenous). ID3 algorithm uses entropy to calculate the
homogeneity of a sample.
If the sample is completely homogeneous the entropy is zero and if the
sample is an equally divided it has entropy of one.
Entrop
y
Information
GainThe information gain is based on the decrease in entropy after a
dataset is split on an attribute. Constructing a decision tree is all about
finding attribute that returns the highest information gain (i.e., the
most homogeneous branches).
Step 1 : Calculate entropy of the target.
Step 2 : The dataset is then split on the different attributes. The
entropy for each branch is calculated.
contd..
SAMCD
13. Use of probability allows flexibility
Objective analysis to decision making
Encourages clear thinking and planning
SAMCD
Advantages of Decision Tree
14. The architecture of our malware detection system. The system
consists of three main modules:
1.PE-Miner
2.Feature selection and data transformation
3.Learning algorithms such as RIPPER.
VISION OF SYSTEM ARCHITECTURE
contd..
SAMCD
15. PE- Miner
PE header
DLL & DLL
call Function
Feature
database
Feature
Selection and
Transformation
Testing set
Training set
Learning
algorithms
Classifications
result
SAMCD
16. We propose data mining algorithm to produce new classifiers with
separate features
RIPPER algorithm
The RIPPER algorithm is an inductive rule learner
Developed to detect examples of malicious executables
This algorithm is using a LibBFD data as characteristics
Building a set of rules that is able to determine the classes while
reducing the ambiguities
SAMCD
ALGORITHM
17. FLOW OF PROJECT
No
Stop
Is Information
of File gain
count = 0?
Start
Prepare the Dataset for .exe Files
Read File attributes from particular File
Separate call header & call code of File
Prepare for Testing
Prepare for the Training set
No
Change the prediction
attribute
Is accuracy of
prediction
correct?
Yes
Files is dirty / malicious
Files is clean / non malicious
SAMCD 17
18. Fast testing
Low overhead
Robust against many confusion
SAMCD 18
ADVANTAGES
19. There is a need for a technique in which detection of malicious patterns in
executable code sequences can be done more efficiently.
It is expected that this procedure will lead to the development of better
algorithms for identifying the malicious code that has infected a system.
SAMCD
CONCLUSION