In this paper, behavioral-based detection powered by machine learning is introduced. As the result, detection ratio is dramatically improved by comparison with traditional detection.
Needless to say that malware detection is getting harder today. Everybody knows signature-based detection reaches its limit, so that most anti-virus vendors use heuristic, behavioral and reputation-based detections altogether. About targeted attack, basically attackers use undetectable malware, so that reputation-based detection doesn't work well because it needs other victims beforehand. And it is a fact that detection ratio is not enough though we use heuristic and behavioral-based detections. In our research using the Metascan, average detection ratio of newest malware by most anti-virus scanner is about 30 %( the best is about 60 %).
By the way, heuristic and behavioral-based detections are developed by knowledge and experience of malware analyst. For example, most analysts know that following features are indicator that those programs are malicious.
- A file imports VirtualAlloc, VirtualProtect and LoadLibrary only and has a strange section name
- An entry point that does not fall within declared text or code section
- Creating remote threads into a legitimate process like explore.exe
- After unpacking, calling OpenMutex and CreateMutex to avoid multiple infections
- Register itself to auto start extension points like services and registry
- Creating a .bat file and try to delete own itself through executing the file with cmd.exe
- Setting global hook to capture keystroke using SetWindowsHookEx
Heuristic and behavioral-based detections are developed based on those pre-determined features like above. Analysts are finding those features day by day. But, this kind of work is not appropriate for human. Therefore we classified programs as malware or benign by machine learning through dynamic analysis results. Thereby, detection ratio is dramatically improved and we could recognize that which features are strongly related to malware by numeric score. And then, we could find the features which we’ve never found by this method. Finally, the outlook and challenges of this method will be tackled.