http://www.free-powerpoint-templates-design.com
Malware Detection using
Machine learning
&
Deep Learning
Minh Đức + Đình Phúc
CyRadar Team at SBC 2019
Disclaimer: This topic is about Machine Learning & Deep Learning
Contents
1. Reality
2. In Research
3. In CyRadar
4. Demo
5. Conclusion
6. Q&A
1. Reality
over 350,000 new malware per day
- is a very big threat in today’s computing
world.
- continues to grow in volume and evolve in
complexity.
- a lot of malware generator.
- The number of websites distributing the malware
is increasing at an alarming rate and is getting out of control.
Malware
- Signature-based: code, hash, behavior, rules,...
Malware detection
Advantages Disadvantages
High accurancy Unable to detect new malware.
Easy to bypass.
Require update database frequenly.
Rely on human expertise in creating
the signatures
*A Theoretical Feature-wise Study of Malware Detection Techniques
2. In Research
1 Malware Detection using Machine Learning and Deep Learning | Hemant
Rathore, Swati Agarwal, Sanjay K. Sahay and Mohit Sewak BITS, Pilani |
Dept. of CS & IS, Goa Campus, Goa, India | 4 Apr 2019.
2 Malware Detection using Windows Api Sequence and Machine Learning |
Chandrasekar Ravi, R Manoharan | Chandrasekar Ravi, R Manoharan |
Department of Computer Science and Engineering, Pondicherry
Engineering College,Pillaichavady, Puducherry - 605014, India | April 2012
3 DeepSign: Deep Learning for Automatic Malware Signature Generation and
Classification | Eli (Omid) David | Dept. of Computer Science Bar-Ilan
University | 23 Nov 2017
4 DeepAM: a heterogeneous deep learning framework for intelligent
malware detection | Yanfang Ye1 · Lingwei Chen1 · Shifu Hou1 · William
Hardy1 · Xin Li | 12 May 2016
5 Behavior-based features model for malware detection | Hisham Shehata
Galal1 · Yousef Bassyouni Mahdy1 · Mohammed Ali Atiea1 | 12 December
2014
6 A Fast Malware Detection Algorithm Based on Objective-Oriented
Association Mining | Yuxin Ding, Xuebing Yuan, Ke Tang, Xiao Xiao, Yibin
Zhang | 19 January 2013
Machine learning principle
Training phase
Detection phase
Extract features
Benign/malware
Training
Predictive model
Predictive model
Unknow
Model decision
Dataset:
• VirusTotal.
• Windows API library.
• VxHeavens website.
• Malicia project.
• ...
small, outdate data.
Malware detection
Static analysis Dynamic analysis
Features:
• Raw Byte.
• Strings.
• Header
• Metadata
• Entropy
• Opcodes
• ….
Features:
• API calls
• Resource usage
• Ports
• Host
• Arguments
• …..
Malware detection
Static analysis Dynamic analysis
Advantages:
• Allows malicious files to be detected
prior to execution.
• Easy to run.
• Fast identification.
Advantages:
• Detecting unconceived types of malware
attacks.
• Detecting the polymorphic malwares.
Malware detection
Static analysis Dynamic analysis
Disadvantages:
• Failing to detect the polymorphic
malwares.
• Each model per sub-type.
• Mistaken for encryption, fileless
malwares,...
Disadvantages:
• Hard to extract feartures.
• Storage complexity for behavioral patterns.
• Time complexity.
Algorithms:
• Supervised learning:
• Decision tree.
• Random forest.
• Logistic Regression.
• SVM.
• Deep Learning
• ...
• Unsupervised learning:
• KNN
• A lot of algorithms have good
results. (> 90%)
• Random forest has best
results.
1. CrowdStrike
2. Cylance
3. Endgame
4. MAX
5. Trapmine
6. SeintinelOne
7. Sophos ML
The AV Industry
3. In CyRadar
PE32 files
push xor …...... call jm
0.125 0.23 ….. 0.345 0.098
0.071 0.123 …. 0.32 0.22
Opcode frequency models
• Binary classification problem
• Static analysis
Opcode
is the portion of a machine language instruction
that specifies what operation is to be performed by
the central processing unit (CPU).
Step 1: Collect data:
• Download pages
• Window's system files.
• Virustotal.
Benign
Malware
Step 1: Collect data.
Step 2: Data cleaning:
1. Remove dupplicated files.
2. Verify with virustotal's API.
Source Number of files
Crawl from download pages 10899
Windows 7 4804
Windows 8 7768
Windows 10 8394
Virustotal 44984
Benign Malware
31865 44984
Step 1: Collect data.
Step 2: Data cleaning.
Step 3: Extract features:
1. Disassembly files.
2. Calculate opcode's frequency.
• Features matrix:
63730 files X 1230 opcode
mov push …. xor and
120 150 ... 100 30
065 12 ... 239 123
Step 1: Collect data.
Step 2: Preprocessing.
Step 3: Extract features.
Step 4: Data preprocessing:
1. Variance threshold (0.1)
2. Remove NANs
• Features matrix:
50388 files X 681 opcode
• Reduce ~45% features
Step 1: Collect data.
Step 2: Preprocessing.
Step 3: Extract features.
Step 4: Data preprocessing:
1. Variance threshold (0.1)
2. Calculate opcode percentage.
3. Remove NANs.
4. Standardize features.
Step 1: Collect data.
Step 2: Preprocessing.
Step 3: Extract features.
Step 4: Dimension reduction.
Step 5: Training:
1. Split train-test data:
• Train: (45349, 681)
• Test : (5039, 681)
2. Try with algorithms:
• Random forest.
• SVM
• Linear regression
• Neural network (9 layers)
Random forest
Neural network
Step 1: Collect data.
Step 2: Preprocessing.
Step 3: Extract features.
Step 4: Dimension reduction.
Step 5: Training.
Step 6: Evaluate models:
Step 1: Collect data.
Step 2: Preprocessing.
Step 3: Extract features.
Step 4: Dimension reduction.
Step 5: Training.
Step 6: Evaluate models:
• Testset: ~5000 files:
• ~2900 malware
• ~2100 benign
Algorithm precision recall
Random Forest
(Machine Learning)
98%
(1996/2037)
97%
(2037/2100)
Deep learning
(DNN 9 Layers)
96%
(1955/2037)
97%
(2037/2100)
4. Demo
5. Conclusion
1. Malware is continues to grow in volume
and evolve in complexity.
2. Traditional approaches is less effective to detect
new malware.
3. There are a lot of research using ML & DL to detect
malware.
4. Industries are trying to apply in to the real world
products.
Internet
Shield
Advanced
Threat
Detection
Web
Email
DNS
EDR
EDR Integrated to
Threat Intelligence Platform
6. Q&A

Malware detection-using-machine-learning

  • 1.
    http://www.free-powerpoint-templates-design.com Malware Detection using Machinelearning & Deep Learning Minh Đức + Đình Phúc CyRadar Team at SBC 2019
  • 2.
    Disclaimer: This topicis about Machine Learning & Deep Learning
  • 3.
    Contents 1. Reality 2. InResearch 3. In CyRadar 4. Demo 5. Conclusion 6. Q&A
  • 4.
  • 5.
    over 350,000 newmalware per day
  • 6.
    - is avery big threat in today’s computing world. - continues to grow in volume and evolve in complexity. - a lot of malware generator. - The number of websites distributing the malware is increasing at an alarming rate and is getting out of control. Malware
  • 7.
    - Signature-based: code,hash, behavior, rules,... Malware detection Advantages Disadvantages High accurancy Unable to detect new malware. Easy to bypass. Require update database frequenly. Rely on human expertise in creating the signatures
  • 8.
    *A Theoretical Feature-wiseStudy of Malware Detection Techniques
  • 9.
  • 10.
    1 Malware Detectionusing Machine Learning and Deep Learning | Hemant Rathore, Swati Agarwal, Sanjay K. Sahay and Mohit Sewak BITS, Pilani | Dept. of CS & IS, Goa Campus, Goa, India | 4 Apr 2019. 2 Malware Detection using Windows Api Sequence and Machine Learning | Chandrasekar Ravi, R Manoharan | Chandrasekar Ravi, R Manoharan | Department of Computer Science and Engineering, Pondicherry Engineering College,Pillaichavady, Puducherry - 605014, India | April 2012 3 DeepSign: Deep Learning for Automatic Malware Signature Generation and Classification | Eli (Omid) David | Dept. of Computer Science Bar-Ilan University | 23 Nov 2017 4 DeepAM: a heterogeneous deep learning framework for intelligent malware detection | Yanfang Ye1 · Lingwei Chen1 · Shifu Hou1 · William Hardy1 · Xin Li | 12 May 2016 5 Behavior-based features model for malware detection | Hisham Shehata Galal1 · Yousef Bassyouni Mahdy1 · Mohammed Ali Atiea1 | 12 December 2014 6 A Fast Malware Detection Algorithm Based on Objective-Oriented Association Mining | Yuxin Ding, Xuebing Yuan, Ke Tang, Xiao Xiao, Yibin Zhang | 19 January 2013
  • 11.
    Machine learning principle Trainingphase Detection phase Extract features Benign/malware Training Predictive model Predictive model Unknow Model decision
  • 12.
    Dataset: • VirusTotal. • WindowsAPI library. • VxHeavens website. • Malicia project. • ... small, outdate data.
  • 13.
    Malware detection Static analysisDynamic analysis Features: • Raw Byte. • Strings. • Header • Metadata • Entropy • Opcodes • …. Features: • API calls • Resource usage • Ports • Host • Arguments • …..
  • 14.
    Malware detection Static analysisDynamic analysis Advantages: • Allows malicious files to be detected prior to execution. • Easy to run. • Fast identification. Advantages: • Detecting unconceived types of malware attacks. • Detecting the polymorphic malwares.
  • 15.
    Malware detection Static analysisDynamic analysis Disadvantages: • Failing to detect the polymorphic malwares. • Each model per sub-type. • Mistaken for encryption, fileless malwares,... Disadvantages: • Hard to extract feartures. • Storage complexity for behavioral patterns. • Time complexity.
  • 16.
    Algorithms: • Supervised learning: •Decision tree. • Random forest. • Logistic Regression. • SVM. • Deep Learning • ... • Unsupervised learning: • KNN • A lot of algorithms have good results. (> 90%) • Random forest has best results.
  • 17.
    1. CrowdStrike 2. Cylance 3.Endgame 4. MAX 5. Trapmine 6. SeintinelOne 7. Sophos ML The AV Industry
  • 18.
  • 19.
    PE32 files push xor…...... call jm 0.125 0.23 ….. 0.345 0.098 0.071 0.123 …. 0.32 0.22 Opcode frequency models • Binary classification problem • Static analysis
  • 20.
    Opcode is the portionof a machine language instruction that specifies what operation is to be performed by the central processing unit (CPU).
  • 21.
    Step 1: Collectdata: • Download pages • Window's system files. • Virustotal. Benign Malware
  • 22.
    Step 1: Collectdata. Step 2: Data cleaning: 1. Remove dupplicated files. 2. Verify with virustotal's API. Source Number of files Crawl from download pages 10899 Windows 7 4804 Windows 8 7768 Windows 10 8394 Virustotal 44984 Benign Malware 31865 44984
  • 23.
    Step 1: Collectdata. Step 2: Data cleaning. Step 3: Extract features: 1. Disassembly files. 2. Calculate opcode's frequency. • Features matrix: 63730 files X 1230 opcode mov push …. xor and 120 150 ... 100 30 065 12 ... 239 123
  • 24.
    Step 1: Collectdata. Step 2: Preprocessing. Step 3: Extract features. Step 4: Data preprocessing: 1. Variance threshold (0.1) 2. Remove NANs • Features matrix: 50388 files X 681 opcode • Reduce ~45% features
  • 25.
    Step 1: Collectdata. Step 2: Preprocessing. Step 3: Extract features. Step 4: Data preprocessing: 1. Variance threshold (0.1) 2. Calculate opcode percentage. 3. Remove NANs. 4. Standardize features.
  • 26.
    Step 1: Collectdata. Step 2: Preprocessing. Step 3: Extract features. Step 4: Dimension reduction. Step 5: Training: 1. Split train-test data: • Train: (45349, 681) • Test : (5039, 681) 2. Try with algorithms: • Random forest. • SVM • Linear regression • Neural network (9 layers) Random forest Neural network
  • 27.
    Step 1: Collectdata. Step 2: Preprocessing. Step 3: Extract features. Step 4: Dimension reduction. Step 5: Training. Step 6: Evaluate models:
  • 28.
    Step 1: Collectdata. Step 2: Preprocessing. Step 3: Extract features. Step 4: Dimension reduction. Step 5: Training. Step 6: Evaluate models: • Testset: ~5000 files: • ~2900 malware • ~2100 benign Algorithm precision recall Random Forest (Machine Learning) 98% (1996/2037) 97% (2037/2100) Deep learning (DNN 9 Layers) 96% (1955/2037) 97% (2037/2100)
  • 29.
  • 30.
    5. Conclusion 1. Malwareis continues to grow in volume and evolve in complexity. 2. Traditional approaches is less effective to detect new malware. 3. There are a lot of research using ML & DL to detect malware. 4. Industries are trying to apply in to the real world products.
  • 31.
  • 32.