Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NanoSec Conference 2019: Malware Classification Using Deep Learning - Mohd Shahril

6 views

Published on

Talk was presented by Mohd Shahril at NanoSec Conference 2019, InterContinental Hotel Kuala Lumpur on the 9th of October 2019.

  • Be the first to comment

  • Be the first to like this

NanoSec Conference 2019: Malware Classification Using Deep Learning - Mohd Shahril

  1. 1. Malware Classification using Deep Learning Mohd Shahril
  2. 2. # whoami ● Mohd Shahril (@mohd_shahril_96) ● CS graduate in AI :) ● Interested in Information Security and Artificial Intelligent ● Software Developer at Pernec Integrated Network Systems (PINS) ● CTF player with local AleJnd team ● Wargames.my Crew
  3. 3. What is this talk about? ● My research ○ Part of my bachelor degree’s final year project ● How Deep Learning works? ● How to leverage this Deep Learning tool for malware classification task
  4. 4. # 0x0: What is Deep Learning?
  5. 5. # 0x0: What is Deep Learning?
  6. 6. # 0x0: What is Deep Learning?
  7. 7. # 0x0: What is Deep Learning? To be less dependent on expert Digitalogy
  8. 8. # 0x0: What is Deep Learning? To be less dependent on expert Good pattern searcher Digitalogy
  9. 9. # 0x0: What is Deep Learning? Eq. Support Vector Machine Find hyperplane that separate two regions of data
  10. 10. # 0x0: What is Deep Learning? To be less dependent on expert (Very!) Good pattern searcher Good pattern searcher Digitalogy
  11. 11. # 0x0: What is Deep Learning? ● Based on Artificial Neural Network ● Was inspired by biological neural networks (brain)
  12. 12. # 0x0: What is Deep Learning? Artificial Neural Network Input Layer Output Layer Hidden Layer
  13. 13. # 0x0: What is Deep Learning? Artificial Neural Network Weight Each lines of these have real-number value Input Layer Output Layer Hidden Layer
  14. 14. # 0x0: What is Deep Learning? Weight Each lines of these have real-number value Neuron Each neuron represent non-linear function of the sum of its input Artificial Neural Network Input Layer Output Layer Hidden Layer
  15. 15. # 0x0: What is Deep Learning? ● Neural network is a collective of linear combinations ○ Neuron summing all its weight input ○ Do non-linear transformation (activation function) ○ Forward output value to the next neuron
  16. 16. # 0x0: What is Deep Learning? ● Cleverness of neural network is determined based on its weights values ○ p/s; weight value = real-number on each of the lines ● Training a neural network means process of adjusting its weight values ○ Backpropagation is a well-known algorithm to train neural network
  17. 17. # 0x0: What is Deep Learning? ● Based on some meth (**math) ● Has ability to learn patterns from data ○ Just another machine learning algorithms, just… more powerful ● Powerful means it can learn very complex pattern from data
  18. 18. # 0x0: What is Deep Learning?
  19. 19. # 0x0: What is Deep Learning? ● When people said deep learning, they basically refer to deep neural network. ● So, what is deep neural network?
  20. 20. # 0x0: What is Deep Learning? (normal) Neural Network Add this moar plz?
  21. 21. # 0x0: What is Deep Learning? (normal) Neural Network (deep) Neural Network Add this moar plz?
  22. 22. # 0x0: What is Deep Learning? ● Deeper network means low-level details can be captured
  23. 23. # 0x0: What is Deep Learning? ● Variety of deep learning architectures have been made ● Solved different kind of problems ○ Computer vision (Human detection from CCTV in real-time) ○ Speech recognition (Youtube’s auto-captions) ○ Essay generator (yes, we have that) ○ etc. ● These problems are hard to solve before deep learning came
  24. 24. # 0x1: Training DL to Classify Malware ● How can we do this? ● DL is known as a great pattern searcher ○ Can differentiate between different classes of data with high accuracy
  25. 25. ● Scopes: ○ Problem: Given new suspected malware executable, how to predict on which malware family it belongs to. ○ Focus on these malware families: ■ Cerber, Cryptowall, GandCarb, Petya, Sality, Wannacrypt ○ Focus on Windows malware executable # 0x1: Training DL to Classify Malware
  26. 26. ● Malware executable, is just another Windows executable # 0x1: Training DL to Classify Malware
  27. 27. ● How can we know if executable is doing something maliciously? ○ pstt: It does naughty things when it execute 😣 ● The idea is to capture runtime behaviors when it is executing ● Run the executable inside sandbox, and collect its behavior data ○ Using Cuckoo Sandbox ○ Bypass common malware “protections”, such as packer and mutation ■ Can’t achieve that if only rely on static data # 0x1: Training DL to Classify Malware
  28. 28. ● Data collection (.exe) using VirusTotal ● For each malware families in the scope, get 1,000 executables ○ Total of 6,000 .exe ● Run each sample into Cuckoo Sandbox and collect behaviors logs (in JSON) # 0x1: Training DL to Classify Malware
  29. 29. # 0x1: Training DL to Classify Malware
  30. 30. # 0x1: Training DL to Classify Malware ● Problem: Deep Learning requires that the input is in fixed-length format ○ Each malware do things differently, so their behavior length data is different ● Idea: Convert behavior data into another format in which DL can understand
  31. 31. # 0x1: Training DL to Classify Malware ● Used Natural-Language Processing (NLP) technique ○ Based on 1-gram extraction technique ○ Split every JSON data into words ○ Count occurences of word inside every JSON files ○ Collect 10,000 most occurences words ○ Maps JSON word into binary (if it exists or not) based on most occurences words
  32. 32. # 0x1: Training DL to Classify Malware [“system”, “sections”, “80386”, “Win32”] Top 1-gram Mapper Sample 1’s 1-gram [1, 0, 1, 1] [“service”, “mutex”, “shellcmds”, “http”] Top 1-gram Mapper Sample 2’s 1-gram [0, 0, 1, 0]
  33. 33. # 0x1: Training DL to Classify Malware Malware Dataset Cuckoo Sandbox Malwares Behaviors Fixed-size binary string 1-gram extraction Each malware behaviors now encoded with fixed-size 10,000 binary string
  34. 34. # 0x1: Training DL to Classify Malware ● Problem: 10,000 fixed-size data still large for Deep Learning training ● Curse of dimensionality ● Idea: Do dimension reduction to the binary data ○ Transform 10,000 binary data into 20 real number ○ Used Deep Autoencoders ■ Special DL architecture for doing non-linear dimensionality reduction High-Dimension Low-Dimensions
  35. 35. # 0x1: Training DL to Classify Malware … … … … … … … … … [10,000] [3,000] [500] [100] [20] [100] [500] [3,000] [10,000] Encoder Layer Decoder Layer Original Input Original Input Deep Autoencoders
  36. 36. # 0x1: Training DL to Classify Malware Deep Autoencoders Fixed-size binary string 20 real numbers
  37. 37. # 0x1: Training DL to Classify Malware
  38. 38. # 0x1: Training DL to Classify Malware ● Now, come the fun part, train Deep Neural Network to classify malware ● Before training the network, the dataset has to be split ○ 70% goes to training set (4200 samples) ○ 30% goes to validation set (1800 samples) ● The reason for the split is to observe how well the network will predict for unseen data
  39. 39. # 0x1: Training DL to Classify Malware ● Used relatively simple Deep Neural Network (DNN) architecture … [20] … [60] … [200] … [40] … [15] … [6] Input 20 real numbers Output Class probability of malware Deep Neural Network
  40. 40. # 0x1: Training DL to Classify Malware Cerber Cryptowall GandCrab Petya Sality Wannacrypt } Network will output probability of malware family Output Layer Example Cerber 0.00 Cryptowall 0.97 GandCrab 0.02 Petya 0.003 Sality 0.007 Wannacrypt 0.0 Total = 1.0
  41. 41. Malware Executables Cuckoo Sandbox Pre-processing i) Transform samples to 1-gram ii) Fetch top-frequent 1-grams iii) Map samples’ 1-gram with top 1-grams Bit-StringDeep Autoencoders Transformed Bit-String 1 2 3 4 Training Set (70%) Validation Set (30%) Validate for AccuracySplit Dataset Training / Evaluate DL 5 Deep Neural Network
  42. 42. # 0x1: Training DL to Classify Malware ● Well, that’s it, folks 😃 ● Two deep networks that need to be trained ○ Deep Autoencoders (for dimension reduction) ○ Deep Neural Network (for malware prediction) ● Accuracy = 96.3% for unseen data
  43. 43. # 0x2 Demo
  44. 44. # 0x3: Problems ● If given executable is non-malicious, this DL will still predict as it belongs to one of these malware family ● Same problem also exists if we try to predict malware not belong in the original six malware families ● Two solutions that I can think off: a. If there is no predicted family probability which greater than 0.95, then we will assume network can’t predict the executable b. Create another class “others”, and put outside samples and train this together
  45. 45. # 0x3: Problems ● This method also has one major flaw, which it can’t be used for runtime malware detection ○ As its reliance on Sandbox is delaying the prediction process ○ Malicious payload has likely already been delivered by the time it is detected ● See paper “Early-stage malware prediction using recurrent neural networks” ○ Only capture first 5-seconds of runtime behaviors ○ Claimed to achieve 94% of accuracy
  46. 46. # 0x3: Problems ● Vulnerable to adversarial attack ○ Generate malware samples which can fool the network ● It attacks the nature of the neural network itself ○ It is ongoing research on how to defend against this attack ● Theoretical Solution: ○ Generate lot of adversarial samples, and train network together with those
  47. 47. # 0x4 Happy Hacking! 😃 https://github.com/shahril96/Malware-Classification-using-Deep-Learning

×