Machine
Learning
Protect against
tomorrow’s
threats
HackNTU
TrendMicro Datasets
趨勢科技 R&D 林瑞豪
July, 2017
Machine
Learning
Protect against
tomorrow’s
threatsDatasets
Malware Dataset(惡意程式資料)
 18,000
Spam Dataset(垃圾郵件資料)
 200,000
Network IPS Dataset(網路入侵防禦系統事
件記錄)
 1,000,000
2017/7/16 2
Machine
Learning
Protect against
tomorrow’s
threats
MALWARE DATASET
2017/7/16 3
Machine
Learning
Protect against
tomorrow’s
threatsMalware Dataset
Sample volume: 18,000 PE malware
Sample size: 48 MB
Collected between Aug. 2015~Jan. 2016
Data category:
 PE header info: JSON format
 section table: JSON format
 import table: TSV
2017/7/16 4
Machine
Learning
Protect against
tomorrow’s
threatsFile Information
Each folder contains the information of a
PE file
info: File Header & Resource Information
sections: Section Table
import: Import Table
52017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsData File Example: info
62017/7/16
$ cat info
{
"DllCharacteristics": "0x8000",
"TimeDateStamp": 538976288,
"BaseOfCode": "0x1000",
"FileEntropy": 5.3841451825025688,
"ImageVersion": "1.0",
"LoaderFlags": "0x0",
"SizeOfStackCommit": 4096,
"SizeOfUninitializedData": 4608,
"SizeOfHeapReserve": 1048576,
"LinkerVersion": "2.25",
"SizeOfHeapCommit": 4096,
"SizeOfStackReserve": 2097152,
"OperatingSystemVersion": "4.0",
"SizeOfHeaders": 1024,
"Subsystem": "0x3",
"NumberOfSections": 8,
"FileAlignment": "0x200",
"SubsystemVersion": "4.0",
"BaseOfData": "0x3000",
"SizeOfOptionalHeader": 224,
"AddressOfEntryPoint": "0x1000",
"SectionAlignment": "0x1000",
"SizeOfCode": 7168,
"ImageBase": "0x400000",
"SizeOfInitializedData": 14848,
"NumberOfSymbols": 0,
"SizeOfImage": 45056,
"NumberOfRvaAndSizes": 16,
"FileSize": 15886,
"Characteristics": "0x32f"
}
PE header info in JSON format
Machine
Learning
Protect against
tomorrow’s
threatsFields of info
72017/7/16
FileSize File size
FileEntropy Entropy of whole file
AddressOfEntryPoint Entry point address
BaseOfCode Beginning of code section
BaseOfData Beginning of data section
ImageBase Preferred address space in memory
TimeDateStamp Low 32 bits of the time stamp of the image
NumberOfSections Number of sections
NumberOfSymbols Number of symbols in symbol table
NumberOfRvaAndSizes Number of directory entries
Characteristics characteristics of the image
DllCharacteristics DLL characteristics
SizeOfOptionalHeader Size of optional headers
SizeOfCode Size of code sections
SizeOfInitializedData Size of initialized data sections
SizeOfUninitializedData Size of uninitialized data sections
SizeOfImage Size of the image
SizeOfHeaders Size of header sections
SizeOfStackReserve Reserved size for stack
SizeOfStackCommit Committed size for stack
SizeOfHeapReserve Reserved size for heap
SizeOfHeapCommit Committed size for heap
FileAlignment Section alignment in file
SectionAlignment Section alignment in memory
LoaderFlags
Subsystem Subsystem required to run this image
SubsystemVersion Version of subsystem
LinkerVersion Version of linker
ImageVersion Version of image
OperatingSystemVersion Version of OS
https://msdn.microsoft.com/en-us/library/windows/desktop/ms680339%28v=vs.85%29.aspx
CompanyName
ProductName
LegalCopyright
FileDescription
FileVersion
ProductVersion
Machine
Learning
Protect against
tomorrow’s
threatsData File Example: sections
82017/7/16
$ cat sections
[
{
"Index": 0,
"Name": ".textu0000u0000u0000",
"Entropy": 5.8200022539922749,
"VirtualSize": 6676,
"Flags": "R-X CODE",
"RawSize": 7168,
"VirtualAddress": "0x1000"
},
{
"Index": 1,
"Name": ".datau0000u0000u0000",
"Entropy": 0.057256602241154482,
"VirtualSize": 68,
"Flags": "RW- IDATA",
"RawSize": 512,
"VirtualAddress": "0x3000"
},
{
"Index": 2,
"Name": ".rdatau0000u0000",
"Entropy": 5.043049159297726,
"VirtualSize": 1824,
"Flags": "R-- IDATA",
"RawSize": 2048,
"VirtualAddress": "0x4000"
},
…skip…
{
"Index": 7,
"Name": ".rsrcu0000u0000u0000",
"Entropy": 4.7784771683762584,
"VirtualSize": 1256,
"Flags": "RW- IDATA",
"RawSize": 1536,
"VirtualAddress": "0xa000"
}
]
Section table in JSON format
Machine
Learning
Protect against
tomorrow’s
threatsFields in sections
92017/7/16
Name Section name
VirtualAddress Section virtual address
VirtualSize Section virtual size
RawSize Section raw size
Entropy Section entropy
Flags Section RWX flags
https://msdn.microsoft.com/en-us/library/windows/desktop/ms680341%28v=vs.85%29.aspx
https://msdn.microsoft.com/en-us/library/ms809762.aspx?ppud=4
Machine
Learning
Protect against
tomorrow’s
threatsData file Example: import
102017/7/16
$ cat import
cygwin1.dll __cxa_atexit
cygwin1.dll __getreent
cygwin1.dll __main
cygwin1.dll _dll_crt0@0
cygwin1.dll _fopen64
cygwin1.dll _impure_ptr
cygwin1.dll atoi
cygwin1.dll callo
ccygwin1.dll cygwin_detach_dll
cygwin1.dll cygwin_internal
cygwin1.dll dll_dllcrt0
cygwin1.dll exit
cygwin1.dll fclose
cygwin1.dll fflush
cygwin1.dll fopen
cygwin1.dll fprintf
cygwin1.dll free
cygwin1.dll fwrite
cygwin1.dll getc
cygwin1.dll malloc
cygwin1.dll posix_memalign
cygwin1.dll printf
cygwin1.dll putc
cygwin1.dll puts
cygwin1.dll realloc
cygwin1.dll vfprintf
KERNEL32.dll GetModuleHandleA
KERNEL32.dll GetProcAddress
DLL name, function
Machine
Learning
Protect against
tomorrow’s
threatsMalware Dataset Example
112017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsMalware Dataset Application
One-class malware identification
Unsupervised malware classification
122017/7/16
https://tbrain.nchc.org.tw/index.php?r=script%2Fview&title_id=6
Machine
Learning
Protect against
tomorrow’s
threats
SPAM DATASET
2017/7/16 13
Machine
Learning
Protect against
tomorrow’s
threatsSpam Dataset
Sample volume: 200,000
Sample size: 1.75GB
Received in July, 2016
Format: EML
Field categories
 From address
 Subject
 Date
 Body (MIME)
142017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsRead Spam Dataset
Python standard library: email
 https://docs.python.org/3/library/email-
examples.html
152017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsSpam Mail Example
162017/7/16
Message-ID: <3210276217-
URSBFSAWVWJITSNSTQBQAGZC@fauudpop.chamblee.default.com>
From: "Alisa Sharpe" <Sharpe_Alisa@chamblee.default.com>
Subject: Re: Enjoy envious stares when you wear our watches
To: <removed>
Date: Tue, 12 Jul 2016 06:49:09 +0600
Mime-Version: 1.0
Content-Type: text/html;
Content-Transfer-Encoding: 7Bit
Like a certain brand of watches, but never wanted to pay the
price? Solve your dilemma now<br>
<a
href="hxxp://889457.finewatch2016.ru#FOlmBCUdEp8EJhjsUpA9GmlqFV4
g"style="color:#0B7303;">HOT OFFER!</a>
Machine
Learning
Protect against
tomorrow’s
threatsSpam Dataset Application
Emergent spam topic
Spam identification
172017/7/16
Machine
Learning
Protect against
tomorrow’s
threats
NETWORK IPS DATASET
2017/7/16 18
Machine
Learning
Protect against
tomorrow’s
threatsNetwork IPS Dataset
Attack behavior log of home router
Sample volume: 1,000,000
Sample size: 250MB
Format: CSV
Field categories
 Device info
 Event info
 Router IP (Obfuscated)
192017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsDevice Info Fields
device_dev_name
 Apple iPad Mini, Google Nexus 5, Sony
PlayStation 4, Synology NAS…etc.
device_os_name
 Apple iOS, Android, Linux, Wii…etc.
device_type_name
 Desktop/Laptop, NAS, DVR, IP Camera…etc.
device_vendor_name
device_hashed_mac
202017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsEvent Info Fields
 event_protocol_id
 Assigned Internet Protocol Numbers by IANA
 1:ICMP, 6:TCP, 17:UDP…etc.
 https://www.iana.org/assignments/protocol-
numbers/protocol-numbers.xhtml
 event_self_ipv4
 Usually private IP or Obfuscated public IP
 event_time
 event_flow_outbound_or_inbound
 event_role_device_or_router
 event_role_server_or_client
212017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsEvent Rule Fields
 event_rule_category
 Access Control, Web Attack, Buffer Overflow
 DoS/DDoS, BotNet…etc.
 event_rule_name
 EXPLOIT Bitcoin/LiteCoin/Dogecoin Mining Activity -
1
 WEB Cross-site Scripting (document.cookie) attempt
 SHELLCODE NOP Sled…etc.
 event_rule_reference
 CVE-2005-0211, CVE-2011-2133, CVE-2014-
4116…etc.
 event_rule_severity
222017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsNetwork IPS Example
Apple iPhone 6 Plus,,c17bdadda83e4200d7ed41b7e6cf5b43c62e725c,Apple
iOS,Smartphone,"Apple Inc.",6,outbound,device,client,Web
Attack,1055396,WEB Cross-site Scripting -9,CVE-2011-2260;CVE-2011-
2710;CVE-2012-0017;CVE-2012-0551;CVE-2012-0719;CVE-2012-1859;CVE-
2012-4939;CVE-2013-5013;CVE-2014-2092;CVE-2013-7051;CVE-2014-
1754;CVE-2014-6325;CVE-2014-6535;CVE-2014-2856;CVE-2014-5360;CVE-
2016-0712;CVE-2016-3212;CVE-2016-6837,5,192.168.1.238,12/28/2016
2:21:04 AM,165.170.147.184
Synology
NAS,,3a85f6a9e776fb803e08ed991fe348b984001bfd,Linux,NAS,Synology
Inc.,17,outbound,device,client,DoS/DDoS,1130172,DNS DNS Amplification
Attacks -1,TA13-088A;CVE-2013-unknown,4,192.168.0.5,12/16/2016 1:39:06
AM,166.29.195.94
Sony PlayStation 3,Game
Console,185b7ce02ec4d07df4b48a8d6f94fdcde8b36492,XMB,Game
Console,Sony Corporation,6,outbound,device,client,Web Attack,1130054,WEB
Directory Traversal -5.a,CVE-2014-1619,4,192.168.1.133,12/23/2016 12:23:17
AM,163.51.28.15
232017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsNetwork IPS Dataset Application
Discover network attacking pattern
Anomaly behavior detection
242017/7/16
Machine
Learning
Protect against
tomorrow’s
threats
T-BRAIN
2017/7/16 25
Machine
Learning
Protect against
tomorrow’s
threatsT-Brain
 https://tbrain.nchc.org.tw/
 Dataset
 Brain
 xgboost, Keras-theano, Keras-tensorflow
 Pandas, sklearn
 Community
262017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsLogin
272017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsDataset on T-Brain
282017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsDownload Dataset
292017/7/16
Password: TBrain
Machine
Learning
Protect against
tomorrow’s
threatsNew script for dataset
302017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsJupyter Notebook
312017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsHide the script
322017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsSample scripts
332017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsSample script
342017/7/16
Machine
Learning
Protect against
tomorrow’s
threatsT-Brain
352017/7/16
Machine
Learning
Protect against
tomorrow’s
threats
THANK YOU
2017/7/16 36

【HITCON Hackathon 2017】 TrendMicro Datasets