Successfully reported this slideshow.
Upcoming SlideShare
×

NanoSec Conference 2019: Malware Classification Using Deep Learning - Mohd Shahril

6 views

Published on

Talk was presented by Mohd Shahril at NanoSec Conference 2019, InterContinental Hotel Kuala Lumpur on the 9th of October 2019.

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

NanoSec Conference 2019: Malware Classification Using Deep Learning - Mohd Shahril

1. 1. Malware Classiﬁcation using Deep Learning Mohd Shahril
2. 2. # whoami ● Mohd Shahril (@mohd_shahril_96) ● CS graduate in AI :) ● Interested in Information Security and Artiﬁcial Intelligent ● Software Developer at Pernec Integrated Network Systems (PINS) ● CTF player with local AleJnd team ● Wargames.my Crew
3. 3. What is this talk about? ● My research ○ Part of my bachelor degree’s ﬁnal year project ● How Deep Learning works? ● How to leverage this Deep Learning tool for malware classiﬁcation task
4. 4. # 0x0: What is Deep Learning?
5. 5. # 0x0: What is Deep Learning?
6. 6. # 0x0: What is Deep Learning?
7. 7. # 0x0: What is Deep Learning? To be less dependent on expert Digitalogy
8. 8. # 0x0: What is Deep Learning? To be less dependent on expert Good pattern searcher Digitalogy
9. 9. # 0x0: What is Deep Learning? Eq. Support Vector Machine Find hyperplane that separate two regions of data
10. 10. # 0x0: What is Deep Learning? To be less dependent on expert (Very!) Good pattern searcher Good pattern searcher Digitalogy
11. 11. # 0x0: What is Deep Learning? ● Based on Artiﬁcial Neural Network ● Was inspired by biological neural networks (brain)
12. 12. # 0x0: What is Deep Learning? Artiﬁcial Neural Network Input Layer Output Layer Hidden Layer
13. 13. # 0x0: What is Deep Learning? Artiﬁcial Neural Network Weight Each lines of these have real-number value Input Layer Output Layer Hidden Layer
14. 14. # 0x0: What is Deep Learning? Weight Each lines of these have real-number value Neuron Each neuron represent non-linear function of the sum of its input Artiﬁcial Neural Network Input Layer Output Layer Hidden Layer
15. 15. # 0x0: What is Deep Learning? ● Neural network is a collective of linear combinations ○ Neuron summing all its weight input ○ Do non-linear transformation (activation function) ○ Forward output value to the next neuron
16. 16. # 0x0: What is Deep Learning? ● Cleverness of neural network is determined based on its weights values ○ p/s; weight value = real-number on each of the lines ● Training a neural network means process of adjusting its weight values ○ Backpropagation is a well-known algorithm to train neural network
17. 17. # 0x0: What is Deep Learning? ● Based on some meth (**math) ● Has ability to learn patterns from data ○ Just another machine learning algorithms, just… more powerful ● Powerful means it can learn very complex pattern from data
18. 18. # 0x0: What is Deep Learning?
19. 19. # 0x0: What is Deep Learning? ● When people said deep learning, they basically refer to deep neural network. ● So, what is deep neural network?
20. 20. # 0x0: What is Deep Learning? (normal) Neural Network Add this moar plz?
21. 21. # 0x0: What is Deep Learning? (normal) Neural Network (deep) Neural Network Add this moar plz?
22. 22. # 0x0: What is Deep Learning? ● Deeper network means low-level details can be captured
23. 23. # 0x0: What is Deep Learning? ● Variety of deep learning architectures have been made ● Solved different kind of problems ○ Computer vision (Human detection from CCTV in real-time) ○ Speech recognition (Youtube’s auto-captions) ○ Essay generator (yes, we have that) ○ etc. ● These problems are hard to solve before deep learning came
24. 24. # 0x1: Training DL to Classify Malware ● How can we do this? ● DL is known as a great pattern searcher ○ Can differentiate between different classes of data with high accuracy
25. 25. ● Scopes: ○ Problem: Given new suspected malware executable, how to predict on which malware family it belongs to. ○ Focus on these malware families: ■ Cerber, Cryptowall, GandCarb, Petya, Sality, Wannacrypt ○ Focus on Windows malware executable # 0x1: Training DL to Classify Malware
26. 26. ● Malware executable, is just another Windows executable # 0x1: Training DL to Classify Malware
27. 27. ● How can we know if executable is doing something maliciously? ○ pstt: It does naughty things when it execute 😣 ● The idea is to capture runtime behaviors when it is executing ● Run the executable inside sandbox, and collect its behavior data ○ Using Cuckoo Sandbox ○ Bypass common malware “protections”, such as packer and mutation ■ Can’t achieve that if only rely on static data # 0x1: Training DL to Classify Malware
28. 28. ● Data collection (.exe) using VirusTotal ● For each malware families in the scope, get 1,000 executables ○ Total of 6,000 .exe ● Run each sample into Cuckoo Sandbox and collect behaviors logs (in JSON) # 0x1: Training DL to Classify Malware
29. 29. # 0x1: Training DL to Classify Malware
30. 30. # 0x1: Training DL to Classify Malware ● Problem: Deep Learning requires that the input is in ﬁxed-length format ○ Each malware do things differently, so their behavior length data is different ● Idea: Convert behavior data into another format in which DL can understand
31. 31. # 0x1: Training DL to Classify Malware ● Used Natural-Language Processing (NLP) technique ○ Based on 1-gram extraction technique ○ Split every JSON data into words ○ Count occurences of word inside every JSON ﬁles ○ Collect 10,000 most occurences words ○ Maps JSON word into binary (if it exists or not) based on most occurences words
32. 32. # 0x1: Training DL to Classify Malware [“system”, “sections”, “80386”, “Win32”] Top 1-gram Mapper Sample 1’s 1-gram [1, 0, 1, 1] [“service”, “mutex”, “shellcmds”, “http”] Top 1-gram Mapper Sample 2’s 1-gram [0, 0, 1, 0]
33. 33. # 0x1: Training DL to Classify Malware Malware Dataset Cuckoo Sandbox Malwares Behaviors Fixed-size binary string 1-gram extraction Each malware behaviors now encoded with ﬁxed-size 10,000 binary string
34. 34. # 0x1: Training DL to Classify Malware ● Problem: 10,000 ﬁxed-size data still large for Deep Learning training ● Curse of dimensionality ● Idea: Do dimension reduction to the binary data ○ Transform 10,000 binary data into 20 real number ○ Used Deep Autoencoders ■ Special DL architecture for doing non-linear dimensionality reduction High-Dimension Low-Dimensions
35. 35. # 0x1: Training DL to Classify Malware … … … … … … … … … [10,000] [3,000] [500] [100] [20] [100] [500] [3,000] [10,000] Encoder Layer Decoder Layer Original Input Original Input Deep Autoencoders
36. 36. # 0x1: Training DL to Classify Malware Deep Autoencoders Fixed-size binary string 20 real numbers
37. 37. # 0x1: Training DL to Classify Malware
38. 38. # 0x1: Training DL to Classify Malware ● Now, come the fun part, train Deep Neural Network to classify malware ● Before training the network, the dataset has to be split ○ 70% goes to training set (4200 samples) ○ 30% goes to validation set (1800 samples) ● The reason for the split is to observe how well the network will predict for unseen data
39. 39. # 0x1: Training DL to Classify Malware ● Used relatively simple Deep Neural Network (DNN) architecture … [20] … [60] … [200] … [40] … [15] … [6] Input 20 real numbers Output Class probability of malware Deep Neural Network
40. 40. # 0x1: Training DL to Classify Malware Cerber Cryptowall GandCrab Petya Sality Wannacrypt } Network will output probability of malware family Output Layer Example Cerber 0.00 Cryptowall 0.97 GandCrab 0.02 Petya 0.003 Sality 0.007 Wannacrypt 0.0 Total = 1.0
41. 41. Malware Executables Cuckoo Sandbox Pre-processing i) Transform samples to 1-gram ii) Fetch top-frequent 1-grams iii) Map samples’ 1-gram with top 1-grams Bit-StringDeep Autoencoders Transformed Bit-String 1 2 3 4 Training Set (70%) Validation Set (30%) Validate for AccuracySplit Dataset Training / Evaluate DL 5 Deep Neural Network
42. 42. # 0x1: Training DL to Classify Malware ● Well, that’s it, folks 😃 ● Two deep networks that need to be trained ○ Deep Autoencoders (for dimension reduction) ○ Deep Neural Network (for malware prediction) ● Accuracy = 96.3% for unseen data
43. 43. # 0x2 Demo
44. 44. # 0x3: Problems ● If given executable is non-malicious, this DL will still predict as it belongs to one of these malware family ● Same problem also exists if we try to predict malware not belong in the original six malware families ● Two solutions that I can think off: a. If there is no predicted family probability which greater than 0.95, then we will assume network can’t predict the executable b. Create another class “others”, and put outside samples and train this together
45. 45. # 0x3: Problems ● This method also has one major ﬂaw, which it can’t be used for runtime malware detection ○ As its reliance on Sandbox is delaying the prediction process ○ Malicious payload has likely already been delivered by the time it is detected ● See paper “Early-stage malware prediction using recurrent neural networks” ○ Only capture ﬁrst 5-seconds of runtime behaviors ○ Claimed to achieve 94% of accuracy
46. 46. # 0x3: Problems ● Vulnerable to adversarial attack ○ Generate malware samples which can fool the network ● It attacks the nature of the neural network itself ○ It is ongoing research on how to defend against this attack ● Theoretical Solution: ○ Generate lot of adversarial samples, and train network together with those
47. 47. # 0x4 Happy Hacking! 😃 https://github.com/shahril96/Malware-Classiﬁcation-using-Deep-Learning