SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
1.
Examining Malware with Python
Phil Roth
Data Scientist at Endgame
@mrphilroth
2.
3
Python tools for text classification can easily be
adopted for malware classification.
When using instruction ngrams, your disassembler
and analysis passes are very important.
references: http://bit.ly/scipy-malware
Conclusions
16.
My Solution
17
Byte ngrams
Instruction
ngrams
Named
features
SelectKBest
SelectKBest
Gradient
Boosting
Classifier
Features Feature Selection Model
Manual
Features
24.
Section Names, Imports, Imported Functions.
Extracted these features with regular expressions.
Features were (awkwardly) selected in the same
step as instruction ngrams.
Named Features
25
25.
Named Features
26
import re
re_features = {
"imports" : {
"re" : re.compile("Imports from w.+"),
"extract" : lambda m : m.group().split()[-1],
"filter" : lambda m : True
},
"imported_functions" : {
"re" : re.compile("__stdcall w.+("),
"extract" : lambda m : m.group().split()[-1][:-1],
"filter" : lambda m : not m.startswith("sub_")
},
"section_names" : {
"re" : re.compile("^S+?:"),
"extract" : lambda m : m.group()[:-1],
"filter" : lambda m : True
}
}
26.
Named Features
27
from toolz import pipe, unique
from tools.curried import map, filter
def process_re_feature(lines, re_dict) :
return pipe(
lines,
map(re_dict["re"].search),
filter(lambda m : m is not None),
map(re_dict["extract"]),
filter(re_dict["filter"]),
unique
)
31.
Final Model tSNE Plot
32
pipe = Pipeline([
("tsvd", TruncatedSVD(n_components=50)),
("tsne", TSNE(n_components=2, perplexity=40.0,
early_exaggeration=4.0, learning_rate=1000.0,
n_iter=1000, metric='euclidean', init='random’))
])
32.
33
Results:
I did OK…
More focused on productization
33.
xgboost
malware as an image
compression ratio as a feature
other expanded feature sets
probability calibration
semi supervised learning
Winning Strategies
34
usable in a product
specific to
competitions
34.
35
ida ******************************
CV Scores: [ 0.03800 0.02551 0.05283 0.03953 0.0350 ]
mean: 0.03817940685733493 std: 0.008799619405211161
capstone ******************************
CV Scores: [ 0.05065 0.0451 0.06953 0.05583 0.05089]
mean: 0.05441113231562615 std: 0.008283830117670508
code = bytes(bytearray.fromhex("".join(map(
lambda l : "".join(l.split()[1:]).replace("?", ""),
open("data/sample/0A32eTdBKayjCWhZqDOQ.bytes", "r")
))))
from capstone import Cs, CS_ARCH_X86, CS_MODE_32
md = Cs(CS_ARCH_X86, CS_MODE_32)
instructions = " ".join(
[t[2] for t in md.disasm_lite(code, 0x1000) if t[2] != "int3"]
)
Using Capstone
35.
IDA not (easily) batch distributable
capstone single pass produces suboptimal results
radare2 Python scriptable reversing framework
vivisect pure Python, largely undocumented
disassembler and analysis project
Disassemblers
36
36.
Other Projects
37
pefile extracts header information from executables
binglide visualizations of entropy and byte ngrams
cuckoo automated dynamic analysis
barf binary analysis framework with code analysis
37.
38
Python tools for text classification can easily be
adopted for malware classification.
When using instruction ngrams, your disassembler
and analysis passes are very important.
references: http://bit.ly/scipy-malware
Conclusions