Challenges in High Accuracy of Malware Detection

Intro
Issues
Objectives
Methodology
Conclusion

Challenges in High Accuracy of
Malware Detection
Muhammad Najmi Ahmad Zabidi
International Islamic University Malaysia

IEEE Control & System Graduate Research Colloquium 2012
Shah Alam, Malaysia

16th July 2012

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 1/26

Intro
Issues
Objectives
Methodology
Conclusion

About

I am a research grad student at Universiti Teknologi
Malaysia, Skudai, Johor Bahru, Malaysia
My current employer is International Islamic University
Malaysia, Kuala Lumpur
Research area - malware detection, narrowing on
Windows executables


Intro
Issues
Objectives
Methodology
Conclusion

Malware in short

is a software
maliciousness is defined on the risks exposed to the user
sometimes, when in vague, the term ‘‘Potentially
Unwanted Program/Application’’ (PUP/PUA) being used


Intro
Issues
Objectives
Methodology
Conclusion

Methods of detections

Static analysis
In this case we have developed a Python based tool,
called as pi-ngaji, an open source tool for static malware
analysis
Dynamic analysis
In this case we will execute the malware in a Windows
environment and dump the API traces into a text file


Intro
Issues
Objectives
Methodology
Conclusion

This talk outline several challenges on the current methods of
malware detection


Intro
Issues
Objectives
Methodology
Conclusion

Analysis of strings

Important, although not foolproof
Find interesting calls first
Considered static analysis, since no executing of the
binary


Intro
Issues
Objectives
Methodology
Conclusion

Methods to find interesting strings

Use strings command (on *NIX systems)
Editors
Checking with Import Address Table (IAT)


Intro
Issues
Objectives
Methodology
Conclusion

Issues

Malware numbers are enormous
Need automation in handling the detection
Our proposal - use Machine Learning methods


Intro
Issues
Objectives
Methodology
Conclusion

Objectives

Reducing features in malware API since
Some are weak, irrelevant features
Considered as ‘‘noise’’
Feature selection, ranking method is chosen


Intro API calls
Issues Anti Debugger/AntiVM strings
Objectives Feature Ranking Selection with Information Gain
Methodology Classification and Clustering
Conclusion

The features

The following are the features
Application Programming Interface (API) calls
XOR’ed strings
Anti virtualization/virtual machine detector
Binary entropy is also interesting


Intro API calls
Conclusion

Binary file structure

Figure: Structure of a PE file[Pietrek, 1994]

Intro API calls
Conclusion

Figure: PE components, simplified


Intro API calls
Conclusion

API calls

Features are as follows:
Example of Features
GetSystemTimeAsFileTime
SetUnhandledExceptionFilte
GetCurrentProces
TerminateProcess
LoadLibraryExW
GetVersionExW
GetProcAddress


Intro API calls
Conclusion

Anti Debugger/AntiVM strings

IsDebuggerPresent
VMCheck.dll


Intro API calls
Conclusion

"Red Pill":"x0fx01x0dx00x00x00x00xc3",
"VirtualPc trick":"x0fx3fx07x0b",
"VMware trick":"VMXh",
"VMCheck.dll":"x45xC7x00x01",
"VMCheck.dll for VirtualPC":"x0fx3fx07x0bxc7x45xfcxffxffxffxff",
"Xen":"XenVMM", # Or XenVMMXenVMM
"Bochs & QEmu CPUID Trick":"x44x4dx41x63",
"Torpig VMM Trick": "xE8xEDxFFxFFxFFx25x00x00x00xFF
x33xC9x3Dx00x00x00x80x0Fx95xC1x8BxC1xC3",
"Torpig (UPX) VMM Trick": "x51x51x0Fx01x27x00xC1xFBxB5xD5x35
x02xE2xC3xD1x66x25x32
xBDx83x7FxB7x4Ex3Dx06x80x0Fx95xC1x8BxC1xC3"

Source: ZeroWine source code


Intro API calls
Conclusion

Sample execution
Analyzing e665297bf9dbb2b2790e4d898d70c9e9

Analyzing registry...
[+] Malware is Adding a Key at Hive: HKEY_LOCAL_MACHINE
^G^@Label11^@^A^AÃˇ^Nreg add "HKEY_LOCAL_MACHINESOFTWAREMicrosoftWindows NTCurrentVersion
R
File Execution OptionsRx.exe" /v debugger /t REG_SZ /d %systemrot%repair1sass.exe /f^M

....

[+] Malware Seems to be IRC BOT: Verified By String : ADMIN
[+] Malware Seems to be IRC BOT: Verified By String : LIST
[+] Malware Seems to be IRC BOT: Verified By String : QUIT
[+] Malware Seems to be IRC BOT: Verified By String : VERSION
Analyzing interesting calls..
[+] Found an Interesting call to: FindWindow
[+] Found an Interesting call to: LoadLibraryA
[+] Found an Interesting call to: CreateProcess
[+] Found an Interesting call to: GetProcAddress
[+] Found an Interesting call to: CopyFile
[+] Found an Interesting call to: shdocvw


Intro API calls
Conclusion

Advantages on the researcher’s side

Malware writers usually are ‘‘lazy’’ hence there is a
tendency they will reuse the previous chunk of codes
Hence, it’s easier to trace the previous family based on
the commonalities


Intro API calls
Conclusion

Our methods

Roughly our methods consist of :

1 Feature Selection(Ranking/Pruning)
2 Supervised Classification
3 Unsupervised Classification

Item 2) and 3) above also could be combined to a method
known as ‘‘Semi Supervised Classification’’.


Intro API calls
Conclusion

Information Gain
[Zhang et al., 2007, Altaher et al., 2011,
Singhal and Raul, 2012] use the following formula for IG
application in malware
The amount by which the entropy of X decreases
reflects additional information about X provided by Y is
called information gain, given by

IG(X |Y ) = H(X ) − H(X |Y )

[Singhal and Raul, 2012] introduced the following algorithm
to ‘‘correct out’’ error the results.
n
i−0 IG(Xi )
IG(X ) = IG(X ) ±
n

Intro API calls
Conclusion

Information Gain (cont’d)

From [Jiang et al., 2011]

P(t , c)
IG(t) = P(t , c)log
P(t )P(c)
c∈{ci ,ci } t ∈{t,t}


Intro API calls
Conclusion

For research purpose the following issues are always
wondered:
No standard dataset, unlike Intrusion Detection System
(IDS) area
Fast-paced malware sample, will the datasets being used
for the experiment will be questioned
Last resort, stick to the existing database, try to free from
any specific malware family as to make sure the method
will/could work with incoming, new malware


Intro API calls
Conclusion

Table: Differences between clustering and classification


Intro API calls
Conclusion


Classification


Intro API calls
Conclusion


Classification

Deals with known data


Intro API calls
Conclusion


Classification


Supervised learning


Intro API calls
Conclusion


Classification


Supervised learning

Popular algorithms includes:
Random Forest
Neural Networks
k-Nearest Neighbor
Decision Trees


Intro API calls
Conclusion


Classification Clustering


Supervised learning

Random Forest
Neural Networks
k-Nearest Neighbor
Decision Trees


Intro API calls
Conclusion



Deals with known data Deals with unknown data

Supervised learning

Random Forest
Neural Networks
k-Nearest Neighbor
Decision Trees


Intro API calls
Conclusion




Supervised learning Unsupervised learning

Random Forest
Neural Networks
k-Nearest Neighbor
Decision Trees


Intro API calls
Conclusion




Supervised learning Unsupervised learning

Popular algorithms includes: Popular algorithms includes:
Random Forest K-means
Neural Networks Fuzzy C
k-Nearest Neighbor Gaussian
Decision Trees


Intro API calls
Conclusion

Classification (supervised) chosen to deal with known
corpus but incomplete data
Clustering (unsupervised) chosen to deal with new inputs


Intro API calls
Conclusion

Some results

We managed to detect several malware samples by using
the existing API traces and other features (bot
commands, file/registry deletion)
New malware which is more sophisticated -
Stuxned/Duqu is very platform specific - attacking SCADA
system hence needs more reading on detecting them.
Perhaps the most obvious if any XOR’ed communication
channels being used.


Intro
Issues
Objectives
Methodology
Conclusion

The flow

Feature Selection Feature Categorization
Weka, Octave/Matlab

Clustering Classification
Weka, Octave/Matlab
scipy, Octave/Matlab

Visualization
scipy, Octave/Matlab


Intro
Issues
Objectives
Methodology
Conclusion

Altaher, A., Ramadass, S., and Ali, A. (2011).
Computer Virus Detection Using Features Ranking and Machine Learning.
Australian Journal of Basic and Applied Sciences, 5(9):1482--1486.

Jiang, Q., Zhao, X., and Huang, K. (2011).
A feature selection method for malware detection.
In 2011 IEEE International Conference on Information and Automation (ICIA), pages 890--895.

Pietrek, M. (1994).
Peering Inside the PE: A Tour of the Win32 Portable Executable File Format.
http://msdn.microsoft.com/en-us/library/ms809762.aspx.

Singhal, P. and Raul, N. (2012).
Malware detection module using machine learning algorithms to assist in centralized security in enterprise
networks.
International Journal of Network Security & Its Applications, 4.

Zhang, B., Yin, J., Hao, J., Wang, S., and Zhang, D. (2007).
New malicious code detection based on n-gram analysis and rough set theory.
pages 626--633. Springer-Verlag, Berlin, Heidelberg.


Challenges in High Accuracy of Malware Detection

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (13)

Similar to Challenges in High Accuracy of Malware Detection

Similar to Challenges in High Accuracy of Malware Detection (20)

Challenges in High Accuracy of Malware Detection