SlideShare a Scribd company logo
Examining Malware with Python
Examining Malware with Python
Phil Roth
Data Scientist at Endgame
@mrphilroth
3
Python tools for text classification can easily be
adopted for malware classification.
When using instruction ngrams, your disassembler
and analysis passes are very important.
references: http://bit.ly/scipy-malware
Conclusions
4
Yes it’s malware, but what kind?
The Data
5
10868 labeled samples
10873 unlabeled samples
~500 GB uncompressed
9 classes
Classes
6
Hex Dump
7
00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20
00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01
00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18
00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04
00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80
00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90
00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19
00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00
00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00
00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00
004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08
004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A
004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04
004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82
004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00
004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00
00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00
00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00
00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10
00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11
00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10
00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01
00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00
00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00
00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11
00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00
raw data in hex
Hex Dump
8
00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20
00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01
00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18
00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04
00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80
00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90
00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19
00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00
00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00
00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00
004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08
004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A
004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04
004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82
004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00
004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00
00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00
00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00
00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10
00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11
00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10
00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01
00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00
00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00
00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11
00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00
00401180
EC 01 2A 10 2A 01 AE
raw data in hex
Disassembly
9
HEADER:00400000 ;
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) |
HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> |
HEADER:00400000 ; | License info: |
HEADER:00400000 ; | Microsoft |
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ;
HEADER:00400000
HEADER:00400000
HEADER:00400000 .686p
HEADER:00400000 .mmx
HEADER:00400000 .model flat
HEADER:00400000
HEADER:00400000 ; ===========================================================================
HEADER:00400000
HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND]
.text:00401000 ;
.text:00401000 ; Format : Portable executable for 80386 (PE)
.text:00401000 ; Imagebase : 400000
.text:00401000 ; Section 1. (virtual address 00001000)
.text:00401000 ; Virtual size : 00071050 ( 462928.)
.text:00401000 ; Section size in file : 00071200 ( 463360.)
.text:00401000 ; Offset to raw data for section: 00000400
.text:00401000 ; Flags 60000020: Text Executable Readable
.text:00401000 ; Alignment : default
.text:00401000 ; ===========================================================================
HEADER:00400000 ;
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) |
HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> |
HEADER:00400000 ; | License info: |
HEADER:00400000 ; | Microsoft |
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ;
HEADER:00400000
HEADER:00400000
HEADER:00400000 .686p
HEADER:00400000 .mmx
HEADER:00400000 .model flat
HEADER:00400000
HEADER:00400000 ; ===========================================================================
HEADER:00400000
HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND]
.text:00401000 ;
.text:00401000 ; Format : Portable executable for 80386 (PE)
.text:00401000 ; Imagebase : 400000
.text:00401000 ; Section 1. (virtual address 00001000)
.text:00401000 ; Virtual size : 00071050 ( 462928.)
.text:00401000 ; Section size in file : 00071200 ( 463360.)
.text:00401000 ; Offset to raw data for section: 00000400
.text:00401000 ; Flags 60000020: Text Executable Readable
.text:00401000 ; Alignment : default
.text:00401000 ; ===========================================================================
Disassembly
10
HEADER:00400000
HEADER:00400000 ;
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) |
HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> |
HEADER:00400000 ; | License info: |
HEADER:00400000 ; | Microsoft |
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ;
HEADER:00400000
HEADER:00400000
HEADER:00400000 .686p
HEADER:00400000 .mmx
HEADER:00400000 .model flat
HEADER:00400000
HEADER:00400000 ; ===========================================================================
HEADER:00400000
HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND]
.text:00401000 ;
.text:00401000 ; Format : Portable executable for 80386 (PE)
.text:00401000 ; Imagebase : 400000
.text:00401000 ; Section 1. (virtual address 00001000)
.text:00401000 ; Virtual size : 00071050 ( 462928.)
.text:00401000 ; Section size in file : 00071200 ( 463360.)
.text:00401000 ; Offset to raw data for section: 00000400
.text:00401000 ; Flags 60000020: Text Executable Readable
.text:00401000 ; Alignment : default
.text:00401000 ; ===========================================================================
Disassembly
11
HEADER:00400000
Disassembly
12
.text:00470050 ; =============== S U B R O U T I N E ====================================
.text:00470050
.text:00470050 ; Attributes: bp-based frame
.text:00470050
.text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp
.text:00470050
.text:00470050 var_68 = dword ptr -68h
.text:00470050 var_64 = dword ptr -64h
.text:00470050 var_60 = dword ptr -60h
.text:00470050
.text:00470050 55 push ebp
.text:00470051 8B EC mov ebp, esp
.text:00470053 83 C4 98 add esp, 0FFFFFF98h
.text:00470056 33 C0 xor eax, eax
.text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C
.text:0047005E 89 55 EC mov [ebp+var_14], edx
.text:00470061 89 45 EC mov [ebp+var_14], eax
.text:00470064 53 push ebx
.text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C
.text:0047006B 83 FB 2D cmp ebx, 2Dh
.text:0047006E 75 03 jnz short loc_470073
.text:00470070 89 5D EC mov [ebp+var_14], ebx
.text:00470073
.text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj
.text:00470073 56 push esi
.text:00470074 33 C0 xor eax, eax
.text:00470076 8B 5D EC mov ebx, [ebp+var_14]
.text:00470050 ; =============== S U B R O U T I N E ====================================
.text:00470050
.text:00470050 ; Attributes: bp-based frame
.text:00470050
.text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp
.text:00470050
.text:00470050 var_68 = dword ptr -68h
.text:00470050 var_64 = dword ptr -64h
.text:00470050 var_60 = dword ptr -60h
.text:00470050
.text:00470050 55 push ebp
.text:00470051 8B EC mov ebp, esp
.text:00470053 83 C4 98 add esp, 0FFFFFF98h
.text:00470056 33 C0 xor eax, eax
.text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C
.text:0047005E 89 55 EC mov [ebp+var_14], edx
.text:00470061 89 45 EC mov [ebp+var_14], eax
.text:00470064 53 push ebx
.text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C
.text:0047006B 83 FB 2D cmp ebx, 2Dh
.text:0047006E 75 03 jnz short loc_470073
.text:00470070 89 5D EC mov [ebp+var_14], ebx
.text:00470073
.text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj
.text:00470073 56 push esi
.text:00470074 33 C0 xor eax, eax
.text:00470076 8B 5D EC mov ebx, [ebp+var_14]
Disassembly
13
mov ebx,dword_4B107C
.text:00470050 ; =============== S U B R O U T I N E ====================================
.text:00470050
.text:00470050 ; Attributes: bp-based frame
.text:00470050
.text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp
.text:00470050
.text:00470050 var_68 = dword ptr -68h
.text:00470050 var_64 = dword ptr -64h
.text:00470050 var_60 = dword ptr -60h
.text:00470050
.text:00470050 55 push ebp
.text:00470051 8B EC mov ebp, esp
.text:00470053 83 C4 98 add esp, 0FFFFFF98h
.text:00470056 33 C0 xor eax, eax
.text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C
.text:0047005E 89 55 EC mov [ebp+var_14], edx
.text:00470061 89 45 EC mov [ebp+var_14], eax
.text:00470064 53 push ebx
.text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C
.text:0047006B 83 FB 2D cmp ebx, 2Dh
.text:0047006E 75 03 jnz short loc_470073
.text:00470070 89 5D EC mov [ebp+var_14], ebx
.text:00470073
.text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj
.text:00470073 56 push esi
.text:00470074 33 C0 xor eax, eax
.text:00470076 8B 5D EC mov ebx, [ebp+var_14]
Disassembly
14
mov ebx,dword_4B107C
Disassembly
15
.idata:0046F4DC ;
.idata:0046F4DC ; Imports from KERNEL32.DLL
.idata:0046F4DC ;
.idata:0046F4DC ; ===========================================================================
.idata:0046F4DC
.idata:0046F4DC ; Segment type: Externs
.idata:0046F4DC ; _idata
.idata:0046F4DC ; DWORD __stdcall GetCurrentThreadId()
.idata:0046F4DC ?? ?? ?? ?? extrn __imp_GetCurrentThreadId:dword
.idata:0046F4DC ; DATA XREF: .text:0046F66C^Yo
.idata:0046F4DC ; GetCurrentThreadId^Yr
.idata:0046F4E0 ; BOOL __stdcall WriteFile(HANDLE hFile, LPCVOID lpBuffer, DWORD ...
.idata:0046F4E0 ?? ?? ?? ?? extrn WriteFile:dword ; DATA XREF: .text:00471E4C^Yr
.idata:0046F4E4 ; BOOL __stdcall FindNextVolumeA(HANDLE hFindVolume, LPSTR lpszVolumeName, DW ...
.idata:0046F4E4 ?? ?? ?? ?? extrn FindNextVolumeA:dword
.idata:0046F4E4 ; DATA XREF: .text:00471E46^Yr
.idata:0046F4E8 ; LPVOID __stdcall VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD ...
.idata:0046F4E8 ?? ?? ?? ?? extrn __imp_VirtualAlloc:dword
.idata:0046F4E8 ; DATA XREF: VirtualAlloc^Yr
.idata:0046F4EC ; BOOL __stdcall EnumResourceLanguagesA(HMODULE hModule, LPCSTR lpType, LPCSTR ...
.idata:0046F4EC ?? ?? ?? ?? extrn EnumResourceLanguagesA:dword
.idata:0046F4EC ; DATA XREF: .text:00471E70^Yr
.idata:0046F4DC ;
.idata:0046F4DC ; Imports from KERNEL32.DLL
.idata:0046F4DC ;
.idata:0046F4DC ; ===========================================================================
.idata:0046F4DC
.idata:0046F4DC ; Segment type: Externs
.idata:0046F4DC ; _idata
.idata:0046F4DC ; DWORD __stdcall GetCurrentThreadId()
.idata:0046F4DC ?? ?? ?? ?? extrn __imp_GetCurrentThreadId:dword
.idata:0046F4DC ; DATA XREF: .text:0046F66C^Yo
.idata:0046F4DC ; GetCurrentThreadId^Yr
.idata:0046F4E0 ; BOOL __stdcall WriteFile(HANDLE hFile, LPCVOID lpBuffer, DWORD ...
.idata:0046F4E0 ?? ?? ?? ?? extrn WriteFile:dword ; DATA XREF: .text:00471E4C^Yr
.idata:0046F4E4 ; BOOL __stdcall FindNextVolumeA(HANDLE hFindVolume, LPSTR lpszVolumeName, DW ...
.idata:0046F4E4 ?? ?? ?? ?? extrn FindNextVolumeA:dword
.idata:0046F4E4 ; DATA XREF: .text:00471E46^Yr
.idata:0046F4E8 ; LPVOID __stdcall VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD ...
.idata:0046F4E8 ?? ?? ?? ?? extrn __imp_VirtualAlloc:dword
.idata:0046F4E8 ; DATA XREF: VirtualAlloc^Yr
.idata:0046F4EC ; BOOL __stdcall EnumResourceLanguagesA(HMODULE hModule, LPCSTR lpType, LPCSTR ...
.idata:0046F4EC ?? ?? ?? ?? extrn EnumResourceLanguagesA:dword
.idata:0046F4EC ; DATA XREF: .text:00471E70^Yr
Disassembly
16
Imports from KERNEL32.DLL
__stdcall VirtualAlloc(
My Solution
17
Byte ngrams
Instruction
ngrams
Named
features
SelectKBest
SelectKBest
Gradient
Boosting
Classifier
Features Feature Selection Model
Manual
Features
Byte ngrams
18
00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20
00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01
00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18
00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04
Possibilies
1gram: 256
2gram: 65536
3gram: 16777216
4gram: 4294967296
Solution: Hashing
Byte ngrams
19
vectorizer = HashingVectorizer(
input="content", lowercase=True, stop_words=None, ngram_range=(1,3),
analyzer="word", n_features=2**16, binary=False, norm=None,
non_negative=True
)
pipe = Pipeline([
("extraction", CustomExtractor(vectorizer=vectorizer)),
("sel", VarianceThreshold(threshold=0)),
("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True,
sublinear_tf=True)),
("kbest", SelectKBest(score_func=f_classif, k=500))
])
Code for extracting the byte ngrams and reducing
dimensionality:
Byte ngrams
20
vectorizer = HashingVectorizer(
input="content", lowercase=True, stop_words=None, ngram_range=(1,3),
analyzer="word", n_features=2**16, binary=False, norm=None,
non_negative=True
)
pipe = Pipeline([
("extraction", CustomExtractor(vectorizer=vectorizer)),
("sel", VarianceThreshold(threshold=0)),
("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True,
sublinear_tf=True)),
("kbest", SelectKBest(score_func=f_classif, k=500))
])
Code for extracting the byte ngrams and reducing
dimensionality:
class CustomExtractor() :
def __init__(self, vectorizer=HashingVectorizer()) :
self.vectorizer = vectorizer
def fit(self, X, y) :
return self # stateless
def transform(self, X, y=None) :
pool = multiprocessing.Pool()
rows = pool.map(self.feature_extract, X, 32)
return scipy.sparse.vstack(list(rows))
fit_transform = transform
def feature_extract(self, file_name) :
clean_bytes = " ".join(toolz.pipe(
open(file_name, "r"),
map(lambda line : line.rstrip().split()[1:]),
toolz.concat,
filter(lambda b : b != "??" and b != "?")
))
return self.vectorizer.transform([clean_bytes])
Byte ngrams
21
Why they might be useful: https://github.com/wapiflapi/binglide
Byte ngrams
22
sample 0A32eTdBKayjCWhZqDOQ
Instruction ngrams
23
push lea push mov call mov mov pop retn
mov jmp
push mov mov call test jz push call add mov pop retn
mov mov mov mov retn
mov lea mov inc test jnz sub retn
mov mov mov push mov push push push push call add mov pop retn
mov mov mov push mov push push push push call add mov pop retn
xor retn
mov retn
mov retn
mov retn
mov mov mov retn
mov test jz mov mov push push call mov mov retn
push push push push call push call mov push push push mov call mov retn
mov mov mov retn
mov test jz mov mov push push call mov mov retn
push push push push call mov push push push mov call push call mov retn
Extracted instructions:
Instruction ngrams
24
vectorizer = HashingVectorizer(
input="content", lowercase=True, stop_words=None, ngram_range=(1, 2),
analyzer="word", n_features=2**25, binary=False, norm=None,
non_negative=True
)
pipe = Pipeline([
("extraction", CustomExtractor(vectorizer=vectorizer)),
("sel", VarianceThreshold(threshold=0)),
("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True,
sublinear_tf=True)),
("kbest", SelectKBest(score_func=f_classif, k=500))
])
Code for extracting the instruction ngrams and reducing
dimensionality:
Section Names, Imports, Imported Functions.
Extracted these features with regular expressions.
Features were (awkwardly) selected in the same
step as instruction ngrams.
Named Features
25
Named Features
26
import re
re_features = {
"imports" : {
"re" : re.compile("Imports from w.+"),
"extract" : lambda m : m.group().split()[-1],
"filter" : lambda m : True
},
"imported_functions" : {
"re" : re.compile("__stdcall w.+("),
"extract" : lambda m : m.group().split()[-1][:-1],
"filter" : lambda m : not m.startswith("sub_")
},
"section_names" : {
"re" : re.compile("^S+?:"),
"extract" : lambda m : m.group()[:-1],
"filter" : lambda m : True
}
}
Named Features
27
from toolz import pipe, unique
from tools.curried import map, filter
def process_re_feature(lines, re_dict) :
return pipe(
lines,
map(re_dict["re"].search),
filter(lambda m : m is not None),
map(re_dict["extract"]),
filter(re_dict["filter"]),
unique
)
Named Features
28
Manual Features
29
{
"number_of_collapsed_functions": 451,
"number_of_imported_functions": 101,
"sample_length": 1201668,
"number_of_imports": 4,
"number_of_sections": 4,
"section_length_0": 979764,
...
“section_length_6”: 0,
"length_of_functions_0": 2706,
...
"length_of_functions_15": 107
}
0A32eTdBKayjCWhZqDOQ
Gradient Boosting Classifier on 1026 features
Grid search optimized parameters
Also tried: LogisticRegression, MultinomialNB,
KNeighborsClassifier, RandomForestClassifier
Final Model
30
clf = GradientBoostingClassifier(
loss='deviance', learning_rate=0.1, n_estimators=300, subsample=0.9,
min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
max_depth=3, init=None, random_state=None, max_features=200,
max_leaf_nodes=None, warm_start=False, verbose=2
)
Final Model tSNE Plot
31
Final Model tSNE Plot
32
pipe = Pipeline([
("tsvd", TruncatedSVD(n_components=50)),
("tsne", TSNE(n_components=2, perplexity=40.0,
early_exaggeration=4.0, learning_rate=1000.0,
n_iter=1000, metric='euclidean', init='random’))
])
33
Results:
I did OK…
More focused on productization
xgboost
malware as an image
compression ratio as a feature
other expanded feature sets
probability calibration
semi supervised learning
Winning Strategies
34
usable in a product
specific to
competitions
35
ida ******************************
CV Scores: [ 0.03800 0.02551 0.05283 0.03953 0.0350 ]
mean: 0.03817940685733493 std: 0.008799619405211161
capstone ******************************
CV Scores: [ 0.05065 0.0451 0.06953 0.05583 0.05089]
mean: 0.05441113231562615 std: 0.008283830117670508
code = bytes(bytearray.fromhex("".join(map(
lambda l : "".join(l.split()[1:]).replace("?", ""),
open("data/sample/0A32eTdBKayjCWhZqDOQ.bytes", "r")
))))
from capstone import Cs, CS_ARCH_X86, CS_MODE_32
md = Cs(CS_ARCH_X86, CS_MODE_32)
instructions = " ".join(
[t[2] for t in md.disasm_lite(code, 0x1000) if t[2] != "int3"]
)
Using Capstone
IDA not (easily) batch distributable
capstone single pass produces suboptimal results
radare2 Python scriptable reversing framework
vivisect pure Python, largely undocumented
disassembler and analysis project
Disassemblers
36
Other Projects
37
pefile extracts header information from executables
binglide visualizations of entropy and byte ngrams
cuckoo automated dynamic analysis
barf binary analysis framework with code analysis
38
Python tools for text classification can easily be
adopted for malware classification.
When using instruction ngrams, your disassembler
and analysis passes are very important.
references: http://bit.ly/scipy-malware
Conclusions
Thank You
Examining Malware with Python

More Related Content

Viewers also liked

Outpost networksecurity
Outpost networksecurityOutpost networksecurity
Outpost networksecurity
ehsangha
 
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKSA SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
IJNSA Journal
 
Python reading and writing files
Python reading and writing filesPython reading and writing files
Python reading and writing files
Mukesh Tekwani
 
Ple
PlePle
Виртуальное рабочее место на базе продуктов Microsoft (Desktops as a service)
Виртуальное рабочее место  на базе продуктов Microsoft (Desktops as a service)Виртуальное рабочее место  на базе продуктов Microsoft (Desktops as a service)
Виртуальное рабочее место на базе продуктов Microsoft (Desktops as a service)
Alexey Vasiliev
 
Cathexis therapeutic imagery
Cathexis therapeutic imageryCathexis therapeutic imagery
Cathexis therapeutic imagery
Shawn Quinlivan, CHT
 
Paes Andrano - Bozza - Piano d’Azione per l’Energia Sostenibile
Paes Andrano - Bozza - Piano d’Azione per l’Energia SostenibilePaes Andrano - Bozza - Piano d’Azione per l’Energia Sostenibile
Paes Andrano - Bozza - Piano d’Azione per l’Energia Sostenibile
Comune di Andrano
 
Assemblea Pubblica 19/12
Assemblea Pubblica 19/12Assemblea Pubblica 19/12
Assemblea Pubblica 19/12
Comune di Andrano
 
Assembly Information Management System
Assembly Information Management SystemAssembly Information Management System
Assembly Information Management System
devlinb
 
Chinese food box
Chinese food boxChinese food box
Chinese food box
Alex Jones
 
New Jersey photos
New Jersey photosNew Jersey photos
New Jersey photos
jemoranjr22
 
Presentasi function room
Presentasi function roomPresentasi function room
Presentasi function room
Desnaz Setiawan
 
Evolution of computers
Evolution of computersEvolution of computers
Evolution of computers
Aryan Kabra
 
Yoga gives your life a new direction
Yoga gives your life a new directionYoga gives your life a new direction
Yoga gives your life a new direction
johnmisbah02
 

Viewers also liked (14)

Outpost networksecurity
Outpost networksecurityOutpost networksecurity
Outpost networksecurity
 
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKSA SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
 
Python reading and writing files
Python reading and writing filesPython reading and writing files
Python reading and writing files
 
Ple
PlePle
Ple
 
Виртуальное рабочее место на базе продуктов Microsoft (Desktops as a service)
Виртуальное рабочее место  на базе продуктов Microsoft (Desktops as a service)Виртуальное рабочее место  на базе продуктов Microsoft (Desktops as a service)
Виртуальное рабочее место на базе продуктов Microsoft (Desktops as a service)
 
Cathexis therapeutic imagery
Cathexis therapeutic imageryCathexis therapeutic imagery
Cathexis therapeutic imagery
 
Paes Andrano - Bozza - Piano d’Azione per l’Energia Sostenibile
Paes Andrano - Bozza - Piano d’Azione per l’Energia SostenibilePaes Andrano - Bozza - Piano d’Azione per l’Energia Sostenibile
Paes Andrano - Bozza - Piano d’Azione per l’Energia Sostenibile
 
Assemblea Pubblica 19/12
Assemblea Pubblica 19/12Assemblea Pubblica 19/12
Assemblea Pubblica 19/12
 
Assembly Information Management System
Assembly Information Management SystemAssembly Information Management System
Assembly Information Management System
 
Chinese food box
Chinese food boxChinese food box
Chinese food box
 
New Jersey photos
New Jersey photosNew Jersey photos
New Jersey photos
 
Presentasi function room
Presentasi function roomPresentasi function room
Presentasi function room
 
Evolution of computers
Evolution of computersEvolution of computers
Evolution of computers
 
Yoga gives your life a new direction
Yoga gives your life a new directionYoga gives your life a new direction
Yoga gives your life a new direction
 

Similar to Examining Malware with Python

crack satellite
crack satellite crack satellite
crack satellite
TecnicoAInstrumentos
 
nullcon 2011 - Memory analysis – Looking into the eye of the bits
nullcon 2011 - Memory analysis – Looking into the eye of the bitsnullcon 2011 - Memory analysis – Looking into the eye of the bits
nullcon 2011 - Memory analysis – Looking into the eye of the bits
n|u - The Open Security Community
 
Compilation process
Compilation processCompilation process
Compilation process
Alex Denisov
 
Looking in the eye of the bits
Looking in the eye of the bitsLooking in the eye of the bits
Looking in the eye of the bits
Iftach Ian Amit
 
Crashinfo
CrashinfoCrashinfo
server
serverserver
Jumpstarting big data projects / Architectural Considerations of HDInsight Ap...
Jumpstarting big data projects / Architectural Considerations of HDInsight Ap...Jumpstarting big data projects / Architectural Considerations of HDInsight Ap...
Jumpstarting big data projects / Architectural Considerations of HDInsight Ap...
Olivia Klose
 
Roc curve, analytics
Roc curve, analyticsRoc curve, analytics
LT SAP HANAネットワークプロトコル初段
LT SAP HANAネットワークプロトコル初段LT SAP HANAネットワークプロトコル初段
LT SAP HANAネットワークプロトコル初段
Koji Shinkubo
 
Harmonic drive hpn gearhead brochure
Harmonic drive hpn gearhead brochureHarmonic drive hpn gearhead brochure
Harmonic drive hpn gearhead brochure
Electromate
 
Example 006
Example 006Example 006
Example 006
klubandayak
 
ambil aja
ambil aja ambil aja
ambil aja
muxander
 
Reverse engineering of binary programs for custom virtual machines
Reverse engineering of binary programs for custom virtual machinesReverse engineering of binary programs for custom virtual machines
Reverse engineering of binary programs for custom virtual machines
SmartDec
 
バイナリかるた(アーキテクチャかるた・完全版)
バイナリかるた(アーキテクチャかるた・完全版)バイナリかるた(アーキテクチャかるた・完全版)
バイナリかるた(アーキテクチャかるた・完全版)
kozossakai
 
バイナリかるた(アーキテクチャかるた)
バイナリかるた(アーキテクチャかるた)バイナリかるた(アーキテクチャかるた)
バイナリかるた(アーキテクチャかるた)
kozossakai
 
Test
TestTest
Aimp3 memory manager_eventlog
Aimp3 memory manager_eventlog Aimp3 memory manager_eventlog
Aimp3 memory manager_eventlog
Ahmad Shabri
 
BlueTeam-RedTeam Exercise - Backdoor containment
BlueTeam-RedTeam Exercise - Backdoor containmentBlueTeam-RedTeam Exercise - Backdoor containment
BlueTeam-RedTeam Exercise - Backdoor containment
giacomo83m
 
IPv6 tools
IPv6 toolsIPv6 tools
IPv6 tools
Fred Bovy
 
Using raw ATC count data with the EFT - Blaise Kelly
Using raw ATC count data with the EFT - Blaise KellyUsing raw ATC count data with the EFT - Blaise Kelly
Using raw ATC count data with the EFT - Blaise Kelly
IES / IAQM
 

Similar to Examining Malware with Python (20)

crack satellite
crack satellite crack satellite
crack satellite
 
nullcon 2011 - Memory analysis – Looking into the eye of the bits
nullcon 2011 - Memory analysis – Looking into the eye of the bitsnullcon 2011 - Memory analysis – Looking into the eye of the bits
nullcon 2011 - Memory analysis – Looking into the eye of the bits
 
Compilation process
Compilation processCompilation process
Compilation process
 
Looking in the eye of the bits
Looking in the eye of the bitsLooking in the eye of the bits
Looking in the eye of the bits
 
Crashinfo
CrashinfoCrashinfo
Crashinfo
 
server
serverserver
server
 
Jumpstarting big data projects / Architectural Considerations of HDInsight Ap...
Jumpstarting big data projects / Architectural Considerations of HDInsight Ap...Jumpstarting big data projects / Architectural Considerations of HDInsight Ap...
Jumpstarting big data projects / Architectural Considerations of HDInsight Ap...
 
Roc curve, analytics
Roc curve, analyticsRoc curve, analytics
Roc curve, analytics
 
LT SAP HANAネットワークプロトコル初段
LT SAP HANAネットワークプロトコル初段LT SAP HANAネットワークプロトコル初段
LT SAP HANAネットワークプロトコル初段
 
Harmonic drive hpn gearhead brochure
Harmonic drive hpn gearhead brochureHarmonic drive hpn gearhead brochure
Harmonic drive hpn gearhead brochure
 
Example 006
Example 006Example 006
Example 006
 
ambil aja
ambil aja ambil aja
ambil aja
 
Reverse engineering of binary programs for custom virtual machines
Reverse engineering of binary programs for custom virtual machinesReverse engineering of binary programs for custom virtual machines
Reverse engineering of binary programs for custom virtual machines
 
バイナリかるた(アーキテクチャかるた・完全版)
バイナリかるた(アーキテクチャかるた・完全版)バイナリかるた(アーキテクチャかるた・完全版)
バイナリかるた(アーキテクチャかるた・完全版)
 
バイナリかるた(アーキテクチャかるた)
バイナリかるた(アーキテクチャかるた)バイナリかるた(アーキテクチャかるた)
バイナリかるた(アーキテクチャかるた)
 
Test
TestTest
Test
 
Aimp3 memory manager_eventlog
Aimp3 memory manager_eventlog Aimp3 memory manager_eventlog
Aimp3 memory manager_eventlog
 
BlueTeam-RedTeam Exercise - Backdoor containment
BlueTeam-RedTeam Exercise - Backdoor containmentBlueTeam-RedTeam Exercise - Backdoor containment
BlueTeam-RedTeam Exercise - Backdoor containment
 
IPv6 tools
IPv6 toolsIPv6 tools
IPv6 tools
 
Using raw ATC count data with the EFT - Blaise Kelly
Using raw ATC count data with the EFT - Blaise KellyUsing raw ATC count data with the EFT - Blaise Kelly
Using raw ATC count data with the EFT - Blaise Kelly
 

Recently uploaded

VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
44annissa
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
gargtinna79
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
sharonblush
 
Willis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdfWillis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdf
LINAT
 
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
Milind Agarwal
 
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptxArtificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
vaishnavisharma877623
 
Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 in City
solankikamal004
 
Universidad de Valladolid degree offer diploma Transcript
Universidad de Valladolid  degree offer diploma TranscriptUniversidad de Valladolid  degree offer diploma Transcript
Universidad de Valladolid degree offer diploma Transcript
taqyea
 
Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript
taqyea
 
Welcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what yourWelcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what your
Virni Arrora
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
palanisamyiiiier
 
Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
harendmgr
 
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
sapna sharmap11
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
taqyea
 
The University of New England degree offer diploma Transcript
The University of New England  degree offer diploma TranscriptThe University of New England  degree offer diploma Transcript
The University of New England degree offer diploma Transcript
taqyea
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
Universidad Camilo José Cela degree offer diploma Transcript
Universidad Camilo José Cela  degree offer diploma TranscriptUniversidad Camilo José Cela  degree offer diploma Transcript
Universidad Camilo José Cela degree offer diploma Transcript
taqyea
 
the unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithmthe unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithm
huseindihon
 
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
dizzycaye
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
GaneshGanesh399816
 

Recently uploaded (20)

VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
 
Willis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdfWillis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdf
 
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
 
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptxArtificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
 
Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 in City
 
Universidad de Valladolid degree offer diploma Transcript
Universidad de Valladolid  degree offer diploma TranscriptUniversidad de Valladolid  degree offer diploma Transcript
Universidad de Valladolid degree offer diploma Transcript
 
Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript
 
Welcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what yourWelcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what your
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
 
Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
 
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
 
The University of New England degree offer diploma Transcript
The University of New England  degree offer diploma TranscriptThe University of New England  degree offer diploma Transcript
The University of New England degree offer diploma Transcript
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
 
Universidad Camilo José Cela degree offer diploma Transcript
Universidad Camilo José Cela  degree offer diploma TranscriptUniversidad Camilo José Cela  degree offer diploma Transcript
Universidad Camilo José Cela degree offer diploma Transcript
 
the unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithmthe unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithm
 
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
 

Examining Malware with Python

  • 2. Examining Malware with Python Phil Roth Data Scientist at Endgame @mrphilroth
  • 3. 3 Python tools for text classification can easily be adopted for malware classification. When using instruction ngrams, your disassembler and analysis passes are very important. references: http://bit.ly/scipy-malware Conclusions
  • 4. 4 Yes it’s malware, but what kind?
  • 5. The Data 5 10868 labeled samples 10873 unlabeled samples ~500 GB uncompressed 9 classes
  • 7. Hex Dump 7 00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20 00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01 00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18 00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04 00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80 00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90 00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19 00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00 00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00 00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00 004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08 004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A 004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04 004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82 004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00 004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00 00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00 00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00 00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10 00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11 00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10 00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01 00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00 00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00 00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11 00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00 raw data in hex
  • 8. Hex Dump 8 00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20 00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01 00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18 00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04 00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80 00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90 00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19 00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00 00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00 00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00 004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08 004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A 004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04 004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82 004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00 004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00 00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00 00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00 00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10 00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11 00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10 00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01 00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00 00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00 00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11 00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00 00401180 EC 01 2A 10 2A 01 AE raw data in hex
  • 9. Disassembly 9 HEADER:00400000 ; HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) | HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> | HEADER:00400000 ; | License info: | HEADER:00400000 ; | Microsoft | HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; HEADER:00400000 HEADER:00400000 HEADER:00400000 .686p HEADER:00400000 .mmx HEADER:00400000 .model flat HEADER:00400000 HEADER:00400000 ; =========================================================================== HEADER:00400000 HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND] .text:00401000 ; .text:00401000 ; Format : Portable executable for 80386 (PE) .text:00401000 ; Imagebase : 400000 .text:00401000 ; Section 1. (virtual address 00001000) .text:00401000 ; Virtual size : 00071050 ( 462928.) .text:00401000 ; Section size in file : 00071200 ( 463360.) .text:00401000 ; Offset to raw data for section: 00000400 .text:00401000 ; Flags 60000020: Text Executable Readable .text:00401000 ; Alignment : default .text:00401000 ; ===========================================================================
  • 10. HEADER:00400000 ; HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) | HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> | HEADER:00400000 ; | License info: | HEADER:00400000 ; | Microsoft | HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; HEADER:00400000 HEADER:00400000 HEADER:00400000 .686p HEADER:00400000 .mmx HEADER:00400000 .model flat HEADER:00400000 HEADER:00400000 ; =========================================================================== HEADER:00400000 HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND] .text:00401000 ; .text:00401000 ; Format : Portable executable for 80386 (PE) .text:00401000 ; Imagebase : 400000 .text:00401000 ; Section 1. (virtual address 00001000) .text:00401000 ; Virtual size : 00071050 ( 462928.) .text:00401000 ; Section size in file : 00071200 ( 463360.) .text:00401000 ; Offset to raw data for section: 00000400 .text:00401000 ; Flags 60000020: Text Executable Readable .text:00401000 ; Alignment : default .text:00401000 ; =========================================================================== Disassembly 10 HEADER:00400000
  • 11. HEADER:00400000 ; HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) | HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> | HEADER:00400000 ; | License info: | HEADER:00400000 ; | Microsoft | HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; HEADER:00400000 HEADER:00400000 HEADER:00400000 .686p HEADER:00400000 .mmx HEADER:00400000 .model flat HEADER:00400000 HEADER:00400000 ; =========================================================================== HEADER:00400000 HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND] .text:00401000 ; .text:00401000 ; Format : Portable executable for 80386 (PE) .text:00401000 ; Imagebase : 400000 .text:00401000 ; Section 1. (virtual address 00001000) .text:00401000 ; Virtual size : 00071050 ( 462928.) .text:00401000 ; Section size in file : 00071200 ( 463360.) .text:00401000 ; Offset to raw data for section: 00000400 .text:00401000 ; Flags 60000020: Text Executable Readable .text:00401000 ; Alignment : default .text:00401000 ; =========================================================================== Disassembly 11 HEADER:00400000
  • 12. Disassembly 12 .text:00470050 ; =============== S U B R O U T I N E ==================================== .text:00470050 .text:00470050 ; Attributes: bp-based frame .text:00470050 .text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp .text:00470050 .text:00470050 var_68 = dword ptr -68h .text:00470050 var_64 = dword ptr -64h .text:00470050 var_60 = dword ptr -60h .text:00470050 .text:00470050 55 push ebp .text:00470051 8B EC mov ebp, esp .text:00470053 83 C4 98 add esp, 0FFFFFF98h .text:00470056 33 C0 xor eax, eax .text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C .text:0047005E 89 55 EC mov [ebp+var_14], edx .text:00470061 89 45 EC mov [ebp+var_14], eax .text:00470064 53 push ebx .text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C .text:0047006B 83 FB 2D cmp ebx, 2Dh .text:0047006E 75 03 jnz short loc_470073 .text:00470070 89 5D EC mov [ebp+var_14], ebx .text:00470073 .text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj .text:00470073 56 push esi .text:00470074 33 C0 xor eax, eax .text:00470076 8B 5D EC mov ebx, [ebp+var_14]
  • 13. .text:00470050 ; =============== S U B R O U T I N E ==================================== .text:00470050 .text:00470050 ; Attributes: bp-based frame .text:00470050 .text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp .text:00470050 .text:00470050 var_68 = dword ptr -68h .text:00470050 var_64 = dword ptr -64h .text:00470050 var_60 = dword ptr -60h .text:00470050 .text:00470050 55 push ebp .text:00470051 8B EC mov ebp, esp .text:00470053 83 C4 98 add esp, 0FFFFFF98h .text:00470056 33 C0 xor eax, eax .text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C .text:0047005E 89 55 EC mov [ebp+var_14], edx .text:00470061 89 45 EC mov [ebp+var_14], eax .text:00470064 53 push ebx .text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C .text:0047006B 83 FB 2D cmp ebx, 2Dh .text:0047006E 75 03 jnz short loc_470073 .text:00470070 89 5D EC mov [ebp+var_14], ebx .text:00470073 .text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj .text:00470073 56 push esi .text:00470074 33 C0 xor eax, eax .text:00470076 8B 5D EC mov ebx, [ebp+var_14] Disassembly 13 mov ebx,dword_4B107C
  • 14. .text:00470050 ; =============== S U B R O U T I N E ==================================== .text:00470050 .text:00470050 ; Attributes: bp-based frame .text:00470050 .text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp .text:00470050 .text:00470050 var_68 = dword ptr -68h .text:00470050 var_64 = dword ptr -64h .text:00470050 var_60 = dword ptr -60h .text:00470050 .text:00470050 55 push ebp .text:00470051 8B EC mov ebp, esp .text:00470053 83 C4 98 add esp, 0FFFFFF98h .text:00470056 33 C0 xor eax, eax .text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C .text:0047005E 89 55 EC mov [ebp+var_14], edx .text:00470061 89 45 EC mov [ebp+var_14], eax .text:00470064 53 push ebx .text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C .text:0047006B 83 FB 2D cmp ebx, 2Dh .text:0047006E 75 03 jnz short loc_470073 .text:00470070 89 5D EC mov [ebp+var_14], ebx .text:00470073 .text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj .text:00470073 56 push esi .text:00470074 33 C0 xor eax, eax .text:00470076 8B 5D EC mov ebx, [ebp+var_14] Disassembly 14 mov ebx,dword_4B107C
  • 15. Disassembly 15 .idata:0046F4DC ; .idata:0046F4DC ; Imports from KERNEL32.DLL .idata:0046F4DC ; .idata:0046F4DC ; =========================================================================== .idata:0046F4DC .idata:0046F4DC ; Segment type: Externs .idata:0046F4DC ; _idata .idata:0046F4DC ; DWORD __stdcall GetCurrentThreadId() .idata:0046F4DC ?? ?? ?? ?? extrn __imp_GetCurrentThreadId:dword .idata:0046F4DC ; DATA XREF: .text:0046F66C^Yo .idata:0046F4DC ; GetCurrentThreadId^Yr .idata:0046F4E0 ; BOOL __stdcall WriteFile(HANDLE hFile, LPCVOID lpBuffer, DWORD ... .idata:0046F4E0 ?? ?? ?? ?? extrn WriteFile:dword ; DATA XREF: .text:00471E4C^Yr .idata:0046F4E4 ; BOOL __stdcall FindNextVolumeA(HANDLE hFindVolume, LPSTR lpszVolumeName, DW ... .idata:0046F4E4 ?? ?? ?? ?? extrn FindNextVolumeA:dword .idata:0046F4E4 ; DATA XREF: .text:00471E46^Yr .idata:0046F4E8 ; LPVOID __stdcall VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD ... .idata:0046F4E8 ?? ?? ?? ?? extrn __imp_VirtualAlloc:dword .idata:0046F4E8 ; DATA XREF: VirtualAlloc^Yr .idata:0046F4EC ; BOOL __stdcall EnumResourceLanguagesA(HMODULE hModule, LPCSTR lpType, LPCSTR ... .idata:0046F4EC ?? ?? ?? ?? extrn EnumResourceLanguagesA:dword .idata:0046F4EC ; DATA XREF: .text:00471E70^Yr
  • 16. .idata:0046F4DC ; .idata:0046F4DC ; Imports from KERNEL32.DLL .idata:0046F4DC ; .idata:0046F4DC ; =========================================================================== .idata:0046F4DC .idata:0046F4DC ; Segment type: Externs .idata:0046F4DC ; _idata .idata:0046F4DC ; DWORD __stdcall GetCurrentThreadId() .idata:0046F4DC ?? ?? ?? ?? extrn __imp_GetCurrentThreadId:dword .idata:0046F4DC ; DATA XREF: .text:0046F66C^Yo .idata:0046F4DC ; GetCurrentThreadId^Yr .idata:0046F4E0 ; BOOL __stdcall WriteFile(HANDLE hFile, LPCVOID lpBuffer, DWORD ... .idata:0046F4E0 ?? ?? ?? ?? extrn WriteFile:dword ; DATA XREF: .text:00471E4C^Yr .idata:0046F4E4 ; BOOL __stdcall FindNextVolumeA(HANDLE hFindVolume, LPSTR lpszVolumeName, DW ... .idata:0046F4E4 ?? ?? ?? ?? extrn FindNextVolumeA:dword .idata:0046F4E4 ; DATA XREF: .text:00471E46^Yr .idata:0046F4E8 ; LPVOID __stdcall VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD ... .idata:0046F4E8 ?? ?? ?? ?? extrn __imp_VirtualAlloc:dword .idata:0046F4E8 ; DATA XREF: VirtualAlloc^Yr .idata:0046F4EC ; BOOL __stdcall EnumResourceLanguagesA(HMODULE hModule, LPCSTR lpType, LPCSTR ... .idata:0046F4EC ?? ?? ?? ?? extrn EnumResourceLanguagesA:dword .idata:0046F4EC ; DATA XREF: .text:00471E70^Yr Disassembly 16 Imports from KERNEL32.DLL __stdcall VirtualAlloc(
  • 18. Byte ngrams 18 00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20 00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01 00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18 00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04 Possibilies 1gram: 256 2gram: 65536 3gram: 16777216 4gram: 4294967296 Solution: Hashing
  • 19. Byte ngrams 19 vectorizer = HashingVectorizer( input="content", lowercase=True, stop_words=None, ngram_range=(1,3), analyzer="word", n_features=2**16, binary=False, norm=None, non_negative=True ) pipe = Pipeline([ ("extraction", CustomExtractor(vectorizer=vectorizer)), ("sel", VarianceThreshold(threshold=0)), ("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True, sublinear_tf=True)), ("kbest", SelectKBest(score_func=f_classif, k=500)) ]) Code for extracting the byte ngrams and reducing dimensionality:
  • 20. Byte ngrams 20 vectorizer = HashingVectorizer( input="content", lowercase=True, stop_words=None, ngram_range=(1,3), analyzer="word", n_features=2**16, binary=False, norm=None, non_negative=True ) pipe = Pipeline([ ("extraction", CustomExtractor(vectorizer=vectorizer)), ("sel", VarianceThreshold(threshold=0)), ("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True, sublinear_tf=True)), ("kbest", SelectKBest(score_func=f_classif, k=500)) ]) Code for extracting the byte ngrams and reducing dimensionality: class CustomExtractor() : def __init__(self, vectorizer=HashingVectorizer()) : self.vectorizer = vectorizer def fit(self, X, y) : return self # stateless def transform(self, X, y=None) : pool = multiprocessing.Pool() rows = pool.map(self.feature_extract, X, 32) return scipy.sparse.vstack(list(rows)) fit_transform = transform def feature_extract(self, file_name) : clean_bytes = " ".join(toolz.pipe( open(file_name, "r"), map(lambda line : line.rstrip().split()[1:]), toolz.concat, filter(lambda b : b != "??" and b != "?") )) return self.vectorizer.transform([clean_bytes])
  • 21. Byte ngrams 21 Why they might be useful: https://github.com/wapiflapi/binglide
  • 23. Instruction ngrams 23 push lea push mov call mov mov pop retn mov jmp push mov mov call test jz push call add mov pop retn mov mov mov mov retn mov lea mov inc test jnz sub retn mov mov mov push mov push push push push call add mov pop retn mov mov mov push mov push push push push call add mov pop retn xor retn mov retn mov retn mov retn mov mov mov retn mov test jz mov mov push push call mov mov retn push push push push call push call mov push push push mov call mov retn mov mov mov retn mov test jz mov mov push push call mov mov retn push push push push call mov push push push mov call push call mov retn Extracted instructions:
  • 24. Instruction ngrams 24 vectorizer = HashingVectorizer( input="content", lowercase=True, stop_words=None, ngram_range=(1, 2), analyzer="word", n_features=2**25, binary=False, norm=None, non_negative=True ) pipe = Pipeline([ ("extraction", CustomExtractor(vectorizer=vectorizer)), ("sel", VarianceThreshold(threshold=0)), ("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True, sublinear_tf=True)), ("kbest", SelectKBest(score_func=f_classif, k=500)) ]) Code for extracting the instruction ngrams and reducing dimensionality:
  • 25. Section Names, Imports, Imported Functions. Extracted these features with regular expressions. Features were (awkwardly) selected in the same step as instruction ngrams. Named Features 25
  • 26. Named Features 26 import re re_features = { "imports" : { "re" : re.compile("Imports from w.+"), "extract" : lambda m : m.group().split()[-1], "filter" : lambda m : True }, "imported_functions" : { "re" : re.compile("__stdcall w.+("), "extract" : lambda m : m.group().split()[-1][:-1], "filter" : lambda m : not m.startswith("sub_") }, "section_names" : { "re" : re.compile("^S+?:"), "extract" : lambda m : m.group()[:-1], "filter" : lambda m : True } }
  • 27. Named Features 27 from toolz import pipe, unique from tools.curried import map, filter def process_re_feature(lines, re_dict) : return pipe( lines, map(re_dict["re"].search), filter(lambda m : m is not None), map(re_dict["extract"]), filter(re_dict["filter"]), unique )
  • 29. Manual Features 29 { "number_of_collapsed_functions": 451, "number_of_imported_functions": 101, "sample_length": 1201668, "number_of_imports": 4, "number_of_sections": 4, "section_length_0": 979764, ... “section_length_6”: 0, "length_of_functions_0": 2706, ... "length_of_functions_15": 107 } 0A32eTdBKayjCWhZqDOQ
  • 30. Gradient Boosting Classifier on 1026 features Grid search optimized parameters Also tried: LogisticRegression, MultinomialNB, KNeighborsClassifier, RandomForestClassifier Final Model 30 clf = GradientBoostingClassifier( loss='deviance', learning_rate=0.1, n_estimators=300, subsample=0.9, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, init=None, random_state=None, max_features=200, max_leaf_nodes=None, warm_start=False, verbose=2 )
  • 31. Final Model tSNE Plot 31
  • 32. Final Model tSNE Plot 32 pipe = Pipeline([ ("tsvd", TruncatedSVD(n_components=50)), ("tsne", TSNE(n_components=2, perplexity=40.0, early_exaggeration=4.0, learning_rate=1000.0, n_iter=1000, metric='euclidean', init='random’)) ])
  • 33. 33 Results: I did OK… More focused on productization
  • 34. xgboost malware as an image compression ratio as a feature other expanded feature sets probability calibration semi supervised learning Winning Strategies 34 usable in a product specific to competitions
  • 35. 35 ida ****************************** CV Scores: [ 0.03800 0.02551 0.05283 0.03953 0.0350 ] mean: 0.03817940685733493 std: 0.008799619405211161 capstone ****************************** CV Scores: [ 0.05065 0.0451 0.06953 0.05583 0.05089] mean: 0.05441113231562615 std: 0.008283830117670508 code = bytes(bytearray.fromhex("".join(map( lambda l : "".join(l.split()[1:]).replace("?", ""), open("data/sample/0A32eTdBKayjCWhZqDOQ.bytes", "r") )))) from capstone import Cs, CS_ARCH_X86, CS_MODE_32 md = Cs(CS_ARCH_X86, CS_MODE_32) instructions = " ".join( [t[2] for t in md.disasm_lite(code, 0x1000) if t[2] != "int3"] ) Using Capstone
  • 36. IDA not (easily) batch distributable capstone single pass produces suboptimal results radare2 Python scriptable reversing framework vivisect pure Python, largely undocumented disassembler and analysis project Disassemblers 36
  • 37. Other Projects 37 pefile extracts header information from executables binglide visualizations of entropy and byte ngrams cuckoo automated dynamic analysis barf binary analysis framework with code analysis
  • 38. 38 Python tools for text classification can easily be adopted for malware classification. When using instruction ngrams, your disassembler and analysis passes are very important. references: http://bit.ly/scipy-malware Conclusions