Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Examining Malware with Python

4,325 views

Published on

Talk given at SciPy 2015.

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Examining Malware with Python

  1. 1. Examining Malware with Python Phil Roth Data Scientist at Endgame @mrphilroth
  2. 2. 3 Python tools for text classification can easily be adopted for malware classification. When using instruction ngrams, your disassembler and analysis passes are very important. references: http://bit.ly/scipy-malware Conclusions
  3. 3. 4 Yes it’s malware, but what kind?
  4. 4. The Data 5 10868 labeled samples 10873 unlabeled samples ~500 GB uncompressed 9 classes
  5. 5. Classes 6
  6. 6. Hex Dump 7 00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20 00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01 00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18 00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04 00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80 00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90 00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19 00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00 00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00 00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00 004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08 004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A 004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04 004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82 004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00 004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00 00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00 00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00 00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10 00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11 00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10 00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01 00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00 00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00 00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11 00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00 raw data in hex
  7. 7. Hex Dump 8 00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20 00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01 00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18 00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04 00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80 00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90 00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19 00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00 00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00 00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00 004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08 004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A 004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04 004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82 004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00 004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00 00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00 00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00 00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10 00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11 00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10 00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01 00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00 00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00 00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11 00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00 00401180 EC 01 2A 10 2A 01 AE raw data in hex
  8. 8. Disassembly 9 HEADER:00400000 ; HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) | HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> | HEADER:00400000 ; | License info: | HEADER:00400000 ; | Microsoft | HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; HEADER:00400000 HEADER:00400000 HEADER:00400000 .686p HEADER:00400000 .mmx HEADER:00400000 .model flat HEADER:00400000 HEADER:00400000 ; =========================================================================== HEADER:00400000 HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND] .text:00401000 ; .text:00401000 ; Format : Portable executable for 80386 (PE) .text:00401000 ; Imagebase : 400000 .text:00401000 ; Section 1. (virtual address 00001000) .text:00401000 ; Virtual size : 00071050 ( 462928.) .text:00401000 ; Section size in file : 00071200 ( 463360.) .text:00401000 ; Offset to raw data for section: 00000400 .text:00401000 ; Flags 60000020: Text Executable Readable .text:00401000 ; Alignment : default .text:00401000 ; ===========================================================================
  9. 9. HEADER:00400000 ; HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) | HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> | HEADER:00400000 ; | License info: | HEADER:00400000 ; | Microsoft | HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; HEADER:00400000 HEADER:00400000 HEADER:00400000 .686p HEADER:00400000 .mmx HEADER:00400000 .model flat HEADER:00400000 HEADER:00400000 ; =========================================================================== HEADER:00400000 HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND] .text:00401000 ; .text:00401000 ; Format : Portable executable for 80386 (PE) .text:00401000 ; Imagebase : 400000 .text:00401000 ; Section 1. (virtual address 00001000) .text:00401000 ; Virtual size : 00071050 ( 462928.) .text:00401000 ; Section size in file : 00071200 ( 463360.) .text:00401000 ; Offset to raw data for section: 00000400 .text:00401000 ; Flags 60000020: Text Executable Readable .text:00401000 ; Alignment : default .text:00401000 ; =========================================================================== Disassembly 10 HEADER:00400000
  10. 10. HEADER:00400000 ; HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) | HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> | HEADER:00400000 ; | License info: | HEADER:00400000 ; | Microsoft | HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; HEADER:00400000 HEADER:00400000 HEADER:00400000 .686p HEADER:00400000 .mmx HEADER:00400000 .model flat HEADER:00400000 HEADER:00400000 ; =========================================================================== HEADER:00400000 HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND] .text:00401000 ; .text:00401000 ; Format : Portable executable for 80386 (PE) .text:00401000 ; Imagebase : 400000 .text:00401000 ; Section 1. (virtual address 00001000) .text:00401000 ; Virtual size : 00071050 ( 462928.) .text:00401000 ; Section size in file : 00071200 ( 463360.) .text:00401000 ; Offset to raw data for section: 00000400 .text:00401000 ; Flags 60000020: Text Executable Readable .text:00401000 ; Alignment : default .text:00401000 ; =========================================================================== Disassembly 11 HEADER:00400000
  11. 11. Disassembly 12 .text:00470050 ; =============== S U B R O U T I N E ==================================== .text:00470050 .text:00470050 ; Attributes: bp-based frame .text:00470050 .text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp .text:00470050 .text:00470050 var_68 = dword ptr -68h .text:00470050 var_64 = dword ptr -64h .text:00470050 var_60 = dword ptr -60h .text:00470050 .text:00470050 55 push ebp .text:00470051 8B EC mov ebp, esp .text:00470053 83 C4 98 add esp, 0FFFFFF98h .text:00470056 33 C0 xor eax, eax .text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C .text:0047005E 89 55 EC mov [ebp+var_14], edx .text:00470061 89 45 EC mov [ebp+var_14], eax .text:00470064 53 push ebx .text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C .text:0047006B 83 FB 2D cmp ebx, 2Dh .text:0047006E 75 03 jnz short loc_470073 .text:00470070 89 5D EC mov [ebp+var_14], ebx .text:00470073 .text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj .text:00470073 56 push esi .text:00470074 33 C0 xor eax, eax .text:00470076 8B 5D EC mov ebx, [ebp+var_14]
  12. 12. .text:00470050 ; =============== S U B R O U T I N E ==================================== .text:00470050 .text:00470050 ; Attributes: bp-based frame .text:00470050 .text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp .text:00470050 .text:00470050 var_68 = dword ptr -68h .text:00470050 var_64 = dword ptr -64h .text:00470050 var_60 = dword ptr -60h .text:00470050 .text:00470050 55 push ebp .text:00470051 8B EC mov ebp, esp .text:00470053 83 C4 98 add esp, 0FFFFFF98h .text:00470056 33 C0 xor eax, eax .text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C .text:0047005E 89 55 EC mov [ebp+var_14], edx .text:00470061 89 45 EC mov [ebp+var_14], eax .text:00470064 53 push ebx .text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C .text:0047006B 83 FB 2D cmp ebx, 2Dh .text:0047006E 75 03 jnz short loc_470073 .text:00470070 89 5D EC mov [ebp+var_14], ebx .text:00470073 .text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj .text:00470073 56 push esi .text:00470074 33 C0 xor eax, eax .text:00470076 8B 5D EC mov ebx, [ebp+var_14] Disassembly 13 mov ebx,dword_4B107C
  13. 13. .text:00470050 ; =============== S U B R O U T I N E ==================================== .text:00470050 .text:00470050 ; Attributes: bp-based frame .text:00470050 .text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp .text:00470050 .text:00470050 var_68 = dword ptr -68h .text:00470050 var_64 = dword ptr -64h .text:00470050 var_60 = dword ptr -60h .text:00470050 .text:00470050 55 push ebp .text:00470051 8B EC mov ebp, esp .text:00470053 83 C4 98 add esp, 0FFFFFF98h .text:00470056 33 C0 xor eax, eax .text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C .text:0047005E 89 55 EC mov [ebp+var_14], edx .text:00470061 89 45 EC mov [ebp+var_14], eax .text:00470064 53 push ebx .text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C .text:0047006B 83 FB 2D cmp ebx, 2Dh .text:0047006E 75 03 jnz short loc_470073 .text:00470070 89 5D EC mov [ebp+var_14], ebx .text:00470073 .text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj .text:00470073 56 push esi .text:00470074 33 C0 xor eax, eax .text:00470076 8B 5D EC mov ebx, [ebp+var_14] Disassembly 14 mov ebx,dword_4B107C
  14. 14. Disassembly 15 .idata:0046F4DC ; .idata:0046F4DC ; Imports from KERNEL32.DLL .idata:0046F4DC ; .idata:0046F4DC ; =========================================================================== .idata:0046F4DC .idata:0046F4DC ; Segment type: Externs .idata:0046F4DC ; _idata .idata:0046F4DC ; DWORD __stdcall GetCurrentThreadId() .idata:0046F4DC ?? ?? ?? ?? extrn __imp_GetCurrentThreadId:dword .idata:0046F4DC ; DATA XREF: .text:0046F66C^Yo .idata:0046F4DC ; GetCurrentThreadId^Yr .idata:0046F4E0 ; BOOL __stdcall WriteFile(HANDLE hFile, LPCVOID lpBuffer, DWORD ... .idata:0046F4E0 ?? ?? ?? ?? extrn WriteFile:dword ; DATA XREF: .text:00471E4C^Yr .idata:0046F4E4 ; BOOL __stdcall FindNextVolumeA(HANDLE hFindVolume, LPSTR lpszVolumeName, DW ... .idata:0046F4E4 ?? ?? ?? ?? extrn FindNextVolumeA:dword .idata:0046F4E4 ; DATA XREF: .text:00471E46^Yr .idata:0046F4E8 ; LPVOID __stdcall VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD ... .idata:0046F4E8 ?? ?? ?? ?? extrn __imp_VirtualAlloc:dword .idata:0046F4E8 ; DATA XREF: VirtualAlloc^Yr .idata:0046F4EC ; BOOL __stdcall EnumResourceLanguagesA(HMODULE hModule, LPCSTR lpType, LPCSTR ... .idata:0046F4EC ?? ?? ?? ?? extrn EnumResourceLanguagesA:dword .idata:0046F4EC ; DATA XREF: .text:00471E70^Yr
  15. 15. .idata:0046F4DC ; .idata:0046F4DC ; Imports from KERNEL32.DLL .idata:0046F4DC ; .idata:0046F4DC ; =========================================================================== .idata:0046F4DC .idata:0046F4DC ; Segment type: Externs .idata:0046F4DC ; _idata .idata:0046F4DC ; DWORD __stdcall GetCurrentThreadId() .idata:0046F4DC ?? ?? ?? ?? extrn __imp_GetCurrentThreadId:dword .idata:0046F4DC ; DATA XREF: .text:0046F66C^Yo .idata:0046F4DC ; GetCurrentThreadId^Yr .idata:0046F4E0 ; BOOL __stdcall WriteFile(HANDLE hFile, LPCVOID lpBuffer, DWORD ... .idata:0046F4E0 ?? ?? ?? ?? extrn WriteFile:dword ; DATA XREF: .text:00471E4C^Yr .idata:0046F4E4 ; BOOL __stdcall FindNextVolumeA(HANDLE hFindVolume, LPSTR lpszVolumeName, DW ... .idata:0046F4E4 ?? ?? ?? ?? extrn FindNextVolumeA:dword .idata:0046F4E4 ; DATA XREF: .text:00471E46^Yr .idata:0046F4E8 ; LPVOID __stdcall VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD ... .idata:0046F4E8 ?? ?? ?? ?? extrn __imp_VirtualAlloc:dword .idata:0046F4E8 ; DATA XREF: VirtualAlloc^Yr .idata:0046F4EC ; BOOL __stdcall EnumResourceLanguagesA(HMODULE hModule, LPCSTR lpType, LPCSTR ... .idata:0046F4EC ?? ?? ?? ?? extrn EnumResourceLanguagesA:dword .idata:0046F4EC ; DATA XREF: .text:00471E70^Yr Disassembly 16 Imports from KERNEL32.DLL __stdcall VirtualAlloc(
  16. 16. My Solution 17 Byte ngrams Instruction ngrams Named features SelectKBest SelectKBest Gradient Boosting Classifier Features Feature Selection Model Manual Features
  17. 17. Byte ngrams 18 00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20 00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01 00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18 00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04 Possibilies 1gram: 256 2gram: 65536 3gram: 16777216 4gram: 4294967296 Solution: Hashing
  18. 18. Byte ngrams 19 vectorizer = HashingVectorizer( input="content", lowercase=True, stop_words=None, ngram_range=(1,3), analyzer="word", n_features=2**16, binary=False, norm=None, non_negative=True ) pipe = Pipeline([ ("extraction", CustomExtractor(vectorizer=vectorizer)), ("sel", VarianceThreshold(threshold=0)), ("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True, sublinear_tf=True)), ("kbest", SelectKBest(score_func=f_classif, k=500)) ]) Code for extracting the byte ngrams and reducing dimensionality:
  19. 19. Byte ngrams 20 vectorizer = HashingVectorizer( input="content", lowercase=True, stop_words=None, ngram_range=(1,3), analyzer="word", n_features=2**16, binary=False, norm=None, non_negative=True ) pipe = Pipeline([ ("extraction", CustomExtractor(vectorizer=vectorizer)), ("sel", VarianceThreshold(threshold=0)), ("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True, sublinear_tf=True)), ("kbest", SelectKBest(score_func=f_classif, k=500)) ]) Code for extracting the byte ngrams and reducing dimensionality: class CustomExtractor() : def __init__(self, vectorizer=HashingVectorizer()) : self.vectorizer = vectorizer def fit(self, X, y) : return self # stateless def transform(self, X, y=None) : pool = multiprocessing.Pool() rows = pool.map(self.feature_extract, X, 32) return scipy.sparse.vstack(list(rows)) fit_transform = transform def feature_extract(self, file_name) : clean_bytes = " ".join(toolz.pipe( open(file_name, "r"), map(lambda line : line.rstrip().split()[1:]), toolz.concat, filter(lambda b : b != "??" and b != "?") )) return self.vectorizer.transform([clean_bytes])
  20. 20. Byte ngrams 21 Why they might be useful: https://github.com/wapiflapi/binglide
  21. 21. Byte ngrams 22 sample 0A32eTdBKayjCWhZqDOQ
  22. 22. Instruction ngrams 23 push lea push mov call mov mov pop retn mov jmp push mov mov call test jz push call add mov pop retn mov mov mov mov retn mov lea mov inc test jnz sub retn mov mov mov push mov push push push push call add mov pop retn mov mov mov push mov push push push push call add mov pop retn xor retn mov retn mov retn mov retn mov mov mov retn mov test jz mov mov push push call mov mov retn push push push push call push call mov push push push mov call mov retn mov mov mov retn mov test jz mov mov push push call mov mov retn push push push push call mov push push push mov call push call mov retn Extracted instructions:
  23. 23. Instruction ngrams 24 vectorizer = HashingVectorizer( input="content", lowercase=True, stop_words=None, ngram_range=(1, 2), analyzer="word", n_features=2**25, binary=False, norm=None, non_negative=True ) pipe = Pipeline([ ("extraction", CustomExtractor(vectorizer=vectorizer)), ("sel", VarianceThreshold(threshold=0)), ("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True, sublinear_tf=True)), ("kbest", SelectKBest(score_func=f_classif, k=500)) ]) Code for extracting the instruction ngrams and reducing dimensionality:
  24. 24. Section Names, Imports, Imported Functions. Extracted these features with regular expressions. Features were (awkwardly) selected in the same step as instruction ngrams. Named Features 25
  25. 25. Named Features 26 import re re_features = { "imports" : { "re" : re.compile("Imports from w.+"), "extract" : lambda m : m.group().split()[-1], "filter" : lambda m : True }, "imported_functions" : { "re" : re.compile("__stdcall w.+("), "extract" : lambda m : m.group().split()[-1][:-1], "filter" : lambda m : not m.startswith("sub_") }, "section_names" : { "re" : re.compile("^S+?:"), "extract" : lambda m : m.group()[:-1], "filter" : lambda m : True } }
  26. 26. Named Features 27 from toolz import pipe, unique from tools.curried import map, filter def process_re_feature(lines, re_dict) : return pipe( lines, map(re_dict["re"].search), filter(lambda m : m is not None), map(re_dict["extract"]), filter(re_dict["filter"]), unique )
  27. 27. Named Features 28
  28. 28. Manual Features 29 { "number_of_collapsed_functions": 451, "number_of_imported_functions": 101, "sample_length": 1201668, "number_of_imports": 4, "number_of_sections": 4, "section_length_0": 979764, ... “section_length_6”: 0, "length_of_functions_0": 2706, ... "length_of_functions_15": 107 } 0A32eTdBKayjCWhZqDOQ
  29. 29. Gradient Boosting Classifier on 1026 features Grid search optimized parameters Also tried: LogisticRegression, MultinomialNB, KNeighborsClassifier, RandomForestClassifier Final Model 30 clf = GradientBoostingClassifier( loss='deviance', learning_rate=0.1, n_estimators=300, subsample=0.9, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, init=None, random_state=None, max_features=200, max_leaf_nodes=None, warm_start=False, verbose=2 )
  30. 30. Final Model tSNE Plot 31
  31. 31. Final Model tSNE Plot 32 pipe = Pipeline([ ("tsvd", TruncatedSVD(n_components=50)), ("tsne", TSNE(n_components=2, perplexity=40.0, early_exaggeration=4.0, learning_rate=1000.0, n_iter=1000, metric='euclidean', init='random’)) ])
  32. 32. 33 Results: I did OK… More focused on productization
  33. 33. xgboost malware as an image compression ratio as a feature other expanded feature sets probability calibration semi supervised learning Winning Strategies 34 usable in a product specific to competitions
  34. 34. 35 ida ****************************** CV Scores: [ 0.03800 0.02551 0.05283 0.03953 0.0350 ] mean: 0.03817940685733493 std: 0.008799619405211161 capstone ****************************** CV Scores: [ 0.05065 0.0451 0.06953 0.05583 0.05089] mean: 0.05441113231562615 std: 0.008283830117670508 code = bytes(bytearray.fromhex("".join(map( lambda l : "".join(l.split()[1:]).replace("?", ""), open("data/sample/0A32eTdBKayjCWhZqDOQ.bytes", "r") )))) from capstone import Cs, CS_ARCH_X86, CS_MODE_32 md = Cs(CS_ARCH_X86, CS_MODE_32) instructions = " ".join( [t[2] for t in md.disasm_lite(code, 0x1000) if t[2] != "int3"] ) Using Capstone
  35. 35. IDA not (easily) batch distributable capstone single pass produces suboptimal results radare2 Python scriptable reversing framework vivisect pure Python, largely undocumented disassembler and analysis project Disassemblers 36
  36. 36. Other Projects 37 pefile extracts header information from executables binglide visualizations of entropy and byte ngrams cuckoo automated dynamic analysis barf binary analysis framework with code analysis
  37. 37. 38 Python tools for text classification can easily be adopted for malware classification. When using instruction ngrams, your disassembler and analysis passes are very important. references: http://bit.ly/scipy-malware Conclusions
  38. 38. Thank You

×