Problems
                  Solution
        Model Illustration




    Ensemble Based
   Categorization and
Adaptive Learning Model
 for Malware Detection
     Muhammad Najmi bin Ahmad Zabidi
       najmi.zabidi@gmail.com

      IAS 2011, Universiti Teknikal Melaka (UTEM)


              6th December 2011
        Muhammad Najmi        Information Assurance and Security Conf (IAS 2011)   1/25
Problems
                             Solution
                   Model Illustration




About



  • Phd student at Universiti Teknologi Malaysia, Skudai
  • Employed by International Islamic University Malaysia,
    Gombak




                   Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   2/25
Problems
                             Solution
                   Model Illustration




Overview


  • Malware detection is considered
    ‘‘undecidable’’[Cohen, 1986]
  • Means 100 percent detection for all time is impossible
  • But there’s still room for highest detection accuracy




                   Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   3/25
Problems
                              Solution
                    Model Illustration




Problem 1 - Features

   • Malware detection depends on features to generate
     signatures
   • Some features could be redundant, hence computation
     time is more expensive
   • Features could be weak, not relevant
   • There is possibility that strong features are enough, and
     discard the weaker ones
   • This, could be reduce by dimesion reduction method




                    Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   4/25
Problems
                              Solution
                    Model Illustration




Problem 2 - Classification of
Software


   • Classification here refers to classification between
     malicious, suspicious and benign software
   • Tackling the problem of false positive, false negative and
     increase precision




                    Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   5/25
Problems
                             Solution
                   Model Illustration




Problem 3 - Tackling new malware



   • Unknown malware is the problem
   • No prior knowledge
   • Suggesting unsupervised categorization




                   Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   6/25
Problems
                                Solution
                      Model Illustration




Related works on malware detection

 Statistical based:
   • [Chouchane et al., 2007, Saudi et al., 2010,
     Merkel et al., 2010]
 Data mining and machine learning:
   • [Sun et al., 2010, Komashinskiy and Kotenko, 2009,
     Komashinskiy and Kotenko, 2010]
   • [Elovici et al., 2007, Gavrilut et al., 2009,
     Firdausi et al., 2010, Golovko et al., 2010]




                      Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   7/25
Problems
                                Solution
                      Model Illustration




Solutions



  Feature Selection
    • Use feature selection to reduce processing overhead




                      Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   8/25
Problems
                             Solution
                   Model Illustration


Categorization and Ensemble

 • Use generic classifier at first to segregate malware and
   non malware
 • Use specific classifier secondly to segregate special traits
   of malware (trojan, worm, virus)
 • Supervised categorization is needed, to classify known
   malware features
 • In recent literatures, the term semi-supervised learning is
   coined to represent the ‘‘assisted’’ unsupervised
   categorization
 • Ensemble classification helps, since base weak learner
   could be boosted
 • Unsupervised categorization (clustering) needed, to
   categorization unknown malware

                  Muhammad Najmi        Information Assurance and Security Conf (IAS 2011)   9/25
Problems
                            Solution
                  Model Illustration




Adaptive Learning

  • Use adaptive learning hence the new malware which
   previously unknown can be taught as known, hence will
   be discarded at early phase




                 Muhammad Najmi        Information Assurance and Security Conf (IAS 2011)   10/25
Problems     Phase 1
                     Solution   Phase 2
           Model Illustration   Phase 3
                                Phase 4



Suggestion of Model




           Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   11/25
Problems     Phase 1
                    Solution   Phase 2
          Model Illustration   Phase 3
                               Phase 4



Phase 1




          Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   12/25
Problems     Phase 1
                              Solution   Phase 2
                    Model Illustration   Phase 3
                                         Phase 4



P1 descriptions



   • Preprocessing work includes ripping API calls, or any
     other useful information from the malware binaries
   • The process of feature selection is being done here




                    Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   13/25
Problems     Phase 1
                              Solution   Phase 2
                    Model Illustration   Phase 3
                                         Phase 4



Features


   • Features, in this case is API calls:
       • The less API calls could be used, the better
       • Dimension reduction method is being used to handle this
   • Future work, we considering adding entropy analysis of
     packed binary body, apart from the API calls profiling




                    Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   14/25
Problems     Phase 1
                             Solution   Phase 2
                   Model Illustration   Phase 3
                                        Phase 4



Interesting API calls
  CreateMutex,
  NtasdfCreateFile
  call shell32
  advapi32.RegOpenKey
  KERNEL32.CreateProcess,
  shdocvw,
  gethostbyname,
  advapi32.RegCreate,
  advapi32.RegSet
  http://
  OutputDebugString
  FindWindow
  IsDebuggerPresent

                  Muhammad Najmi        Information Assurance and Security Conf (IAS 2011)   15/25
Problems     Phase 1
                    Solution   Phase 2
          Model Illustration   Phase 3
                               Phase 4



Phase 2




          Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   16/25
Problems     Phase 1
                              Solution   Phase 2
                    Model Illustration   Phase 3
                                         Phase 4



P2 Descriptions


   • Malware being categorized according to common traits
     of generic malware
   • Next, specific symptom according to the classes of
     malware (worm, trojan, virus) being done
   • Malware could have all the packages together, but
     usually there is dominant feature




                   Muhammad Najmi        Information Assurance and Security Conf (IAS 2011)   17/25
Problems     Phase 1
                    Solution   Phase 2
          Model Illustration   Phase 3
                               Phase 4



Phase 3




          Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   18/25
Problems     Phase 1
                               Solution   Phase 2
                     Model Illustration   Phase 3
                                          Phase 4



P3 Descriptions


   • Use ensemble based classification, using weak learners
   • Many weak learners, via voting could represent more
     accurate results
   • If there is unknown class, it will go into into clustering
     phase




                     Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   19/25
Problems     Phase 1
          Solution   Phase 2
Model Illustration   Phase 3
                     Phase 4




Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   20/25
Problems     Phase 1
                                Solution   Phase 2
                      Model Illustration   Phase 3
                                           Phase 4



P4 Descriptions


   • A signature being created, if the malware is new
   • The new signature will be added to the current
     categorization
   • This will minimize the next detection cycle for the next
     malware




                      Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   21/25
Problems     Phase 1
                              Solution   Phase 2
                    Model Illustration   Phase 3
                                         Phase 4



The Dataset

 In malware research, there is no standard dataset, unlike
 Intrusion Detection area which usually relied on KDD/MIT
 Lincoln datasets.
   • We obtain malware samples from
     CyberSecurityMalaysia(CSM), consists of 2GB malware
     files, amounted around 30,000 malware binaries
   • We have to build our own dataset to extract the features
   • This, considered preprocessing work




                    Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   22/25
Problems     Phase 1
                                Solution   Phase 2
                      Model Illustration   Phase 3
                                           Phase 4



Conclusion

   • Soft computing approach could assist in malware
     detection
   • Feature selection could assist in minimizing feature
     processing
   • Ensemble methods could help in increasing malware
     categorization
   • Adaptive learning could help in avoiding redundant
     retraining for the n next iteration




                      Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   23/25
Problems     Phase 1
          Solution   Phase 2
Model Illustration   Phase 3
                     Phase 4




Muhammad Najmi       Information Assurance and Security Conf (IAS 2011)   24/25
Problems     Phase 1
                             Solution   Phase 2
                   Model Illustration   Phase 3
                                        Phase 4



Bibliography
   Chouchane, M. R., Walenstein, A., and Lakhotia, A. (2007).
   Statistical signatures for fast filtering of
   instruction-substituting metamorphic malware.
   In Proceedings of the 2007 ACM workshop on Recurring
   malcode, WORM ’07, pages 31--37, New York, NY, USA.
   ACM.
   Cohen, F. B. (1986).
   Computer viruses.
   PhD thesis, Los Angeles, CA, USA.
   AAI0559804.
   Elovici, Y., Shabtai, A., Moskovitch, R., Tahan, G., and
   Glezer, C. (2007).
   Applying machine learning techniques for detection of
   malicious code in network traffic.
                  Muhammad Najmi        Information Assurance and Security Conf (IAS 2011)   25/25

Ensembled Based Categorization and Adaptive Learning Model for Malware Detection

  • 1.
    Problems Solution Model Illustration Ensemble Based Categorization and Adaptive Learning Model for Malware Detection Muhammad Najmi bin Ahmad Zabidi najmi.zabidi@gmail.com IAS 2011, Universiti Teknikal Melaka (UTEM) 6th December 2011 Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 1/25
  • 2.
    Problems Solution Model Illustration About • Phd student at Universiti Teknologi Malaysia, Skudai • Employed by International Islamic University Malaysia, Gombak Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 2/25
  • 3.
    Problems Solution Model Illustration Overview • Malware detection is considered ‘‘undecidable’’[Cohen, 1986] • Means 100 percent detection for all time is impossible • But there’s still room for highest detection accuracy Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 3/25
  • 4.
    Problems Solution Model Illustration Problem 1 - Features • Malware detection depends on features to generate signatures • Some features could be redundant, hence computation time is more expensive • Features could be weak, not relevant • There is possibility that strong features are enough, and discard the weaker ones • This, could be reduce by dimesion reduction method Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 4/25
  • 5.
    Problems Solution Model Illustration Problem 2 - Classification of Software • Classification here refers to classification between malicious, suspicious and benign software • Tackling the problem of false positive, false negative and increase precision Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 5/25
  • 6.
    Problems Solution Model Illustration Problem 3 - Tackling new malware • Unknown malware is the problem • No prior knowledge • Suggesting unsupervised categorization Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 6/25
  • 7.
    Problems Solution Model Illustration Related works on malware detection Statistical based: • [Chouchane et al., 2007, Saudi et al., 2010, Merkel et al., 2010] Data mining and machine learning: • [Sun et al., 2010, Komashinskiy and Kotenko, 2009, Komashinskiy and Kotenko, 2010] • [Elovici et al., 2007, Gavrilut et al., 2009, Firdausi et al., 2010, Golovko et al., 2010] Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 7/25
  • 8.
    Problems Solution Model Illustration Solutions Feature Selection • Use feature selection to reduce processing overhead Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 8/25
  • 9.
    Problems Solution Model Illustration Categorization and Ensemble • Use generic classifier at first to segregate malware and non malware • Use specific classifier secondly to segregate special traits of malware (trojan, worm, virus) • Supervised categorization is needed, to classify known malware features • In recent literatures, the term semi-supervised learning is coined to represent the ‘‘assisted’’ unsupervised categorization • Ensemble classification helps, since base weak learner could be boosted • Unsupervised categorization (clustering) needed, to categorization unknown malware Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 9/25
  • 10.
    Problems Solution Model Illustration Adaptive Learning • Use adaptive learning hence the new malware which previously unknown can be taught as known, hence will be discarded at early phase Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 10/25
  • 11.
    Problems Phase 1 Solution Phase 2 Model Illustration Phase 3 Phase 4 Suggestion of Model Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 11/25
  • 12.
    Problems Phase 1 Solution Phase 2 Model Illustration Phase 3 Phase 4 Phase 1 Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 12/25
  • 13.
    Problems Phase 1 Solution Phase 2 Model Illustration Phase 3 Phase 4 P1 descriptions • Preprocessing work includes ripping API calls, or any other useful information from the malware binaries • The process of feature selection is being done here Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 13/25
  • 14.
    Problems Phase 1 Solution Phase 2 Model Illustration Phase 3 Phase 4 Features • Features, in this case is API calls: • The less API calls could be used, the better • Dimension reduction method is being used to handle this • Future work, we considering adding entropy analysis of packed binary body, apart from the API calls profiling Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 14/25
  • 15.
    Problems Phase 1 Solution Phase 2 Model Illustration Phase 3 Phase 4 Interesting API calls CreateMutex, NtasdfCreateFile call shell32 advapi32.RegOpenKey KERNEL32.CreateProcess, shdocvw, gethostbyname, advapi32.RegCreate, advapi32.RegSet http:// OutputDebugString FindWindow IsDebuggerPresent Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 15/25
  • 16.
    Problems Phase 1 Solution Phase 2 Model Illustration Phase 3 Phase 4 Phase 2 Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 16/25
  • 17.
    Problems Phase 1 Solution Phase 2 Model Illustration Phase 3 Phase 4 P2 Descriptions • Malware being categorized according to common traits of generic malware • Next, specific symptom according to the classes of malware (worm, trojan, virus) being done • Malware could have all the packages together, but usually there is dominant feature Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 17/25
  • 18.
    Problems Phase 1 Solution Phase 2 Model Illustration Phase 3 Phase 4 Phase 3 Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 18/25
  • 19.
    Problems Phase 1 Solution Phase 2 Model Illustration Phase 3 Phase 4 P3 Descriptions • Use ensemble based classification, using weak learners • Many weak learners, via voting could represent more accurate results • If there is unknown class, it will go into into clustering phase Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 19/25
  • 20.
    Problems Phase 1 Solution Phase 2 Model Illustration Phase 3 Phase 4 Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 20/25
  • 21.
    Problems Phase 1 Solution Phase 2 Model Illustration Phase 3 Phase 4 P4 Descriptions • A signature being created, if the malware is new • The new signature will be added to the current categorization • This will minimize the next detection cycle for the next malware Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 21/25
  • 22.
    Problems Phase 1 Solution Phase 2 Model Illustration Phase 3 Phase 4 The Dataset In malware research, there is no standard dataset, unlike Intrusion Detection area which usually relied on KDD/MIT Lincoln datasets. • We obtain malware samples from CyberSecurityMalaysia(CSM), consists of 2GB malware files, amounted around 30,000 malware binaries • We have to build our own dataset to extract the features • This, considered preprocessing work Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 22/25
  • 23.
    Problems Phase 1 Solution Phase 2 Model Illustration Phase 3 Phase 4 Conclusion • Soft computing approach could assist in malware detection • Feature selection could assist in minimizing feature processing • Ensemble methods could help in increasing malware categorization • Adaptive learning could help in avoiding redundant retraining for the n next iteration Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 23/25
  • 24.
    Problems Phase 1 Solution Phase 2 Model Illustration Phase 3 Phase 4 Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 24/25
  • 25.
    Problems Phase 1 Solution Phase 2 Model Illustration Phase 3 Phase 4 Bibliography Chouchane, M. R., Walenstein, A., and Lakhotia, A. (2007). Statistical signatures for fast filtering of instruction-substituting metamorphic malware. In Proceedings of the 2007 ACM workshop on Recurring malcode, WORM ’07, pages 31--37, New York, NY, USA. ACM. Cohen, F. B. (1986). Computer viruses. PhD thesis, Los Angeles, CA, USA. AAI0559804. Elovici, Y., Shabtai, A., Moskovitch, R., Tahan, G., and Glezer, C. (2007). Applying machine learning techniques for detection of malicious code in network traffic. Muhammad Najmi Information Assurance and Security Conf (IAS 2011) 25/25