Generation of Static YARA-Signatures Using Genetic Algorithm
1. June 17, 2019
Generation of Static YARA-Signatures Using
Genetic Algorithm
Alexander Zhdanov (Jdanoff)
Inria Rennes-
Bretagne Atlantique
2. Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 2
Abstract
Subject: malware detection using static YARA-signatures and
Genetic Algorithm (GA).
The proposed solution: two algorithms
n-gram distributions and Maximization-Maximization
(MM) algorithm based on Multinomial Naive Bayes
(MNB) analysis.
n-gram analysis and a method of directional generation
of YARA-rules based on the Genetic Algorithm (GA) ∈
Artificial Intelligence (AI)
4. Iucundum est narrare sua mala
A problem shared is a problem
halved
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 4
1Introduction
Malware detection and generation
of static YARA-signatures
5. Introduction
YARA-rules: principle
YARA library and scanner is a defacto standard in malware
signature scanning for files
The YARA signature rule format is an easy-to-understand
DSL with a C-like syntax
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 5
6. Introduction
YARA-rules: syntax
1 rule silent_banker : banker
{
3 meta:
description = "This is just an
example"
5 thread_level = 3
in_the_wild = true
7 strings:
$a = {6A 40 68 00 30 00 00 6A 14 8D
91}
9 $b = {8D 4D B0 2B C1 83 C0 27 99 6A
4E 59 F7 F9}
$c = " UVODFRYSIHLNWPEJXQZAKCBGMT "
11 condition:
$a or $b or $c
13 }
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 6
7. Introduction
YARA-rules: pros and cons
pros:
easy-to read and understand
fast classification (string (pattern) matching)
fast sharing and update of yara-database (VirusTotal)
cons:
Static signatures are not prone to malware mutation, packing,
obfuscation
Yara-rules are written manually (performance, optimality)
Yara can run slowly on big datasets ...
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 7
8. Introduction
YARA-rules: generators
”yarGen“[4] and ”YaraGenerator” [3]: a common strings
approach
”yabin” [2]: clustering based on code re-usage and discovery
of rare function signatures
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 8
9. Background
Bene diagnoscitur, bene curatur
A disease known is half cured
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 9
2Background
N-grams and GA
10. Background N-grams
N-gram frequency
Byte n-grams are called overlapping substrings of the program P.
For a given n-gram Ng, the n-gram frequency τ(Ng, P) is a count
of Ng in P where τ(Ng, P) = 0 ⇐⇒ Ng /∈ P and τ(Ng, P) ∈ Z∗
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 10
11. Background N-grams
Logarithmic Likelihood function
For known program classes Ck ∈ C with distributions of
class-normalized and total-normalized n-gram frequencies
ψ(Ng, Ck) and ψ(Ng) accordingly, the probability LL(NgCx , Ck)
that analyzed n-gram set NgCx belongs to class Ck is defined as:
Ng∈NgCx
τ(Ng, Cx ) ∗ log(ψ(Ng, Ck)) −
Ng∈NgCx
log(ψ(Ng)). (1)
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 11
12. Background GA
GA: generation (population)
A generation (population) πk = (Xk
1 , ..., Xk
|πk |
) of size |πk| ∈ N1 is
termed a finite set of individuals Xk
l ∈ A|X| for
l = {1, ..., |πk|}which is indexed by natural numbers k ∈ N0.
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 12
13. Background GA
GA: fitness function
A fitness function f is a function f : A|X| → [0, 1] where A|X| is the
individual pool of size |X| ∈ N.
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 13
14. Algorithms
Acquirit qui tuetur
Sparing is the first gaining
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 14
3
Algorithms
MNB MM and GA
15. Algorithms
MM & MNB: MM algorithm
The objective function LL(NgMx , Mi ) defined in eq. 1 is maximized
in two steps:
Maximization of LL(NgMx , Mi ) with respect to the set of
n-grams NgMx ∈ Mx
Maximization of LL(NgMx , Mi ) with respect to the malware
class Mi .
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 15
16. Algorithms
GA: optimization task
An optimization target for the YARA-rules generation:
maximize f (X)
subject to X∈A|X|
(2)
where f (X) is the fitness value for the rule X according to the
fitness function f : A|X| → [0, 1] defined previously.
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 16
17. Algorithms
GA: A fitness function
The fitness function is defined as follows:
f (X) = F1(X) + γ ∗ pf (X) (3)
where f (X) is a fitness value, F1(X) is a detection rate, pf (X) is a
frequency estimate for the individual X ∈ A|X| and γ ∈ [0, 1] is a
weight.
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 17
18. Algorithms
GA: A training cycle
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 18
19. Experimental results
Eodem cubito, eadem trutina,
pari libra
The elbow, the same balance,
an equal balance
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 19
4
Experimental results
5 malware families
20. Experimental results Setup
Datasets
cleanware
Small dataset
2613 files: executables, pdf documents, web pages, mp3,
mp4,...
1.3 Gb
malware
11 malware families
bladabindi, convertad, downloadadmin, icloader, loadmoney,
multiplug, parite, ramnit, softcnapp, upatre, vtflooder
in total 8,329 binaries, 8.5Gb
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 20
21. Experimental results Setup
Model extraction
algorithm: sliding window
n-gram size: 5 bytes [1]
cleanware
Size: 5.9 Gb
malware
the model is built for 5 malware families: downloadadmin,
loadmoney, ramnit, vtflooder, convertad
in total 7Gb
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 21
22. Experimental results Setup
Data from cleanware model
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 22
23. Experimental results Results
MM algorithm, heursitic #1, results
Table 1 shows precision-recall-detection rates for the MM
algorithm, heursitic #1.
Table: Precision-Recall-Detection rates for Maximization-Maximization
(MM) algorithm, heursitic #1
mw. family Maximization-Maximization algorithm heuristic #1
bin. bin. rej.
prc./rcl.(%) f1(%) prc./rcl.(%) f1(%)
downloadadmin 99/93 95 96/93 95
loadmoney 85/6 11 62/6 10
ramnit 38/82 51 9/82 16
vtflooder 99/90 94 69/90 78
convertad 58/85 68 18/85 29
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander zhdanov June 17, 2019- 23
24. Experimental results Results
GA results
Table 2 presents experimental results for the GA.
Table: Precision-Recall-Detection rates for Genetic Algorithm (GA)
mw. family Machine Learning GA
bin. bin. rej.
prc./rcl.(%) f1 (%) prc./rcl.(%) f1 (%)
downloadadmin 99/99 99 99/99 99
loadmoney 99/89 93 98/89 93
ramnit 68/65 66 19/65 29
vtflooder 100/99 99 95/99 96
convertad 97/99 97 44/99 60
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander zhdanov June 17, 2019- 24
25. Experimental results Results
MM, heuristic #1, convergence
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander zhdanov June 17, 2019- 25
26. Experimental results Results
GA, convergence
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander zhdanov June 17, 2019- 26
27. Discussion & Conclusions
Docendo discimus
We learn by teaching
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 27
5
Discussion &
Conclusions
28. Discussion & Conclusions
Experimental results
Experimental results show (detection rate):
Overall: F1(Xga) ≥ F1(Xmm)
GA: limtc→∞ F1(Xtc
ga) = 100%
MM and MNB: limtc→∞ F1(Xtc
mm) = 50%
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander zhdanov June 17, 2019- 28
29. Discussion & Conclusions
Experimental results
Experimental results show (space complexity):
GA: O(Mi )
MM and MNB: O( n
j=1 Mj + C)
Overall: O(Mi ) < O( n
j=1 Mj + C)
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander zhdanov June 17, 2019- 29
30. Discussion & Conclusions
Experimental results
Experimental results show (time complexity):
GA: O(|πk| ∗ |X| ∗ ( n
i=1 |Mi | + |C|)) where |πk| is a
population size, |X| is a YARA-rule size and |Mi | is a size of
the malware family Mi in the dataset (≈ 9710 sec).
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander zhdanov June 17, 2019- 30
31. Discussion & Conclusions
Conclusions & Future Work
Comparative analysis of the GA detection performance with
other AI/ML methods (bio-inspired).
Test GA detection quality on compressed and encrypted files,
on polymorphic and metamorphic viruses.
Solution to the problem of the GA training performance.
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander zhdanov June 17, 2019- 31
32. Discussion & Conclusions
Latet enim veritas, sed nihil
pretiosius veritate
Truth is hidden, but nothing is
more beautiful than the truth
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 32
7
Literature
33. Discussion & Conclusions
Literature
Av-test. the independent it-security institute.
https://www.av-test.org/. Accessed: 2019-30-03.
yabin. https://github.com/AlienVault-OTX/yabin. Accessed:
2019-30-03.
Yaragenerator. https://github.com/Xen0ph0n/YaraGenerator.
Accessed: 2019-30-03.
yargen. https://github.com/Neo23x0/yarGen. Accessed:
2019-30-03.
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 33
34. Discussion & Conclusions
Literature
James Mayfield and Paul McNamee. Single n-gram stemming.
In Proceedings of the 26th Annual International ACM SIGIR
Conference on Research and Development in Informaion
Retrieval, SIGIR ’03, pages 415 - 416, New York, NY, USA,
2003. ACM.
Flemming Nielson, Hanne R. Nielson, and Chris Hankin.
Principles of Program Analysis. Springer-Verlag, Berlin,
Heidelberg, 1999.
Michael Sikorski and Andrew Honig. Practical Malware
Analysis: The Hands-On Guide to Dissecting Malicious
Software. No Starch Press, San Francisco, CA, USA, 1st
edition, 2012.
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 34
35. Discussion & Conclusions
Literature
M. Srinivas and L. M. Patnaik. Adaptive probabilities of
crossover and mutation in genetic algorithms.IEEE
Transactions on Systems, Man, and Cybernetics, 24(4):656 -
667, Apr 1994.
Clayton L Bridges and David E. Goldberg. An analysis of
reproduction and crossover in a binary-coded genetic
algorithm. In Proceedings of the Second International
Conference on Genetic Algorithms and Their Application,
pages 9ˆa13, Hillsdale, NJ, USA, 1987. L. Erlbaum Associates
Inc.
Nuwan I. Senaratna. Genetic algorithms: The
crossover-mutation debate. 2005.
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 35
36. Discussion & Conclusions
Literature
Kent Griffin, Scott Schneider, Xin Hu, and Tzi-cker Chiueh.
Automatic generation of string signatures for malware
detection. In Engin Kirda, Somesh Jha, and Davide Balzarotti,
editors, Recent Advances in Intrusion Detection, pages
101-120, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg.
Keehyung Kim and Byung-Ro Moon. Malware detection based
on dependency graph using hybrid genetic algorithm. In
Proceedings of the 12th Annual Conference on Genetic and
Evolutionary Computation, GECCO ˆa10, pages 1211-1218,
New York, NY, USA, 2010. ACM.
J.Z. Kolter and M.A. Maloof. Learning to detect malicious
executables in the wild. Proceedings of the 2004 ACM
SIGKDD international conference on Knowledge discovery and
data mining, pages 470-478, 2004
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 36
37. Discussion & Conclusions
Literature
Naik, Nitin, Jenkins, Paul, Savage, Nick and Yang, Longzhi
(2019) Cyberthreat Hunting - Part 1: Triaging Ransomware
using Fuzzy Hashing, Import Hashing and YARA Rules. In:
2019 IEEE International Conference on Fuzzy Systems
(FUZZ-IEEE), 23-26 June 2019, New Orleans. (In Press)
D. Krishna Sandeep Reddy and Arun K. Pujari. N-gram
analysis for computer virus detection. Journal in Computer
Virology, 2(3):231-239, Dec 2006.
Amount of monetary damage caused by reported cyber crime
to the IC3 from 2001 to 2017 (in million U.S. dollars). In
In Statista - The Statistics Portal. Retrieved May 29, 2019,
https://www.statista.com/statistics/267132/total-damage-
caused-by-by-cyber-crime-in-the-us/
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 37
39. Discussion & Conclusions
Thanks
Thank you and also to:
Prof. Dr. Axel Legay
Dr. Fabrizio Biondi
Dr. Olivier Zendra
Dr. Sophie Pinchinat
Dr. Michel Hurfin
Jean Quilbeuf
TAMIS team
Generation of Static YARA-Signatures Using Genetic Algorithm Alexander Zhdanov June 17, 2019- 39