2. network-behavior-based RAT detection method has been
studied by several network security researchers. Li [7] extracts
6 network behavior features and uses cluster algorithms to
detect Trojan from network traffics. But normal P2P service is
found to have great influence on the performance of this
detection method. Dan [8] mainly focuses on the first few
packets of a TCP session and applies classification algorithms,
However, the chosen network behavior features need to be
optimized, since only 2 of the 7 extracted by Dan are
uncorrelated. Besides, the packet number in early stage is too
few to provide sufficient features in real network environment.
In [5], 927 cross-layer network features are analysed by 3
different classification algorithms to detect rnalware sessions.
This solid work is targeted at detecting dozens types of
malware, making it too complicated for RAT detection
III. OVERVIEW OF RAT COMMUNICATION
A typical communication mechanism of reverse RAT is
described in this part A complete RAT session starts with a
successful TCP handshake and ends with TCP FIN or RST
process. After TCP connection has been established, RAT will
come through a period called early stage, where interval time
between two adjacent packets is less than the threshold t
second(s) [8]. Early stage is when the server part and client part
of RAT exchanges some basic information, After that, some
small packets carrying hacker's command would be sent to the
client side. As a response the client would send several packets
back, and these packets are relatively larger than the command
packets because they may carry confidential data. During the
idle time when hacker gives no instruction, keep-alive
heartbeat packets may occur between client and server to tell
each other they are still connected.
Payload lengt h
.'
RAT se r ver s ide (Hacker)
However, not every RAT session acts exactly the same as
the described typical communication pattern. Few RATs may
not send heartbeat packets or have more inbound bytes than
outbound bytes. Some normal application may act like RAT in
some way. For example, P2P and cloud sessions also prefer to
use PSH flag, and remote desktop service may send heartbeat
packets. Thus, none of those features alone can distinguish
RAT session from legitimate session, and find a way to
integrate those features is the key to solve this RAT detection
problem.
IV REVERSE RAT DEJECTION MErnOD
Our RAT detection method is described in this section. We
first pick out four network behavior attributes that best
represent the differences between RAT and normal sessions.
Then we apply different machine learning models to learn the
mapping between the four attributes and the fact that whether a
session is RAT or not. A labeled data set consists of real RAT
and normal network traffic is used to train and test our models.
A. Feature Selection and Extraction
We extract four features from every complete TCP session:
• out-in-bytes-ratio: the ratio between average outbound
byte and inbound byte. It's a positive continuous value.
• PSH-flag-ralio: the ratio between number of packets
with PSH flag and session packet number. It's a
positive continuous value.
• early-stage-pocket-number: the packet number of
session's early stage with threshold t set to 1 second.
It's a positive integer.
• heartbeat-flag: the flag of whether a session has
heartbeat packets. It's a Boolean with value of aor 1.
To obtain the above four features, the basic information of
every TCP packet we need to collect is listed in Table I.
To get the early-stage-packet-numher, we calculate the
time interval between every two adjacent packets until the time
interval is greater than the threshold t (1 second here). RAT's
heartbeat starts with a fixed length of packet sent by client and
a fixed answering packet sent by server, and this pattern would
repeat for several times. In this paper we set the repeat time to
be 3. The pseudo code of heartbeat packets detection algorithin
is shown in Fig. 2. The overall network behavior feature
extraction process is shown in Fig. 3.
Packet info Meaning
Srcjp Source IP address
Src---'port Source TCP port
Dstip Destination IF address
Dstport Destination TCP port
Timestamp Packet timestamp
flags TCP flags including SYNIACK!FIN/RST/PSH
Payload_len Payload length in bytes
i t ime
TIT SY!'! : Early Data Heartbeat
handshake i Stage i translliss ion i packets
---~----------+--------------------."..j:,-------------------","I
, . iS3 RAT cl i ent si de Glormal us er)
Fig. 1. Typical RAT session communication process
There are some network behavior differences between of
legitimate and RAT sessions. Legitimate sessions tend to
transport data as soon and as much as possible once TCP
handshake finished, so most legitimate sessions don't have any
early stage. Most RAT sessions transport more outbound data
than inbound to send out confidential information, while
normal sessions often behave in the opposite way. The PSH
flag in the TCP header informs the receiving host that the data
should be pushed up to the receiving application immediately.
The rate of packets containing PSH flag of RAT sessions is
very likely to be higher than that of legitimate sessions,
because hackers always hope their packets to have higher
priority. Moreover normal sessions usually don't have
heartbeat packets.
1008
TABLE!. PACKET INFORMATION TO BEEXTRACTED
3. tlJUJ
(I)
STATISTICAL FEATURES OF TIIE COLLECTED DATA SETTABLE II.
Based on previous studies [2][4-5] related to RAT detection,
six different machine leam-ing algorithms are chosen for this
classification problem including kNN (k Nearest Neighbor),
Naive Bayes, Logistic Regression, SVM (Support Vector
Machine), Ran-dom Forest and decision tree. All algorithins
are implemented in python scikit-learn library.
We use IO-Fold cross validation to split the data for
training and testing in all tests, and we use accuracy and Area
under the Curve (AUC) metrics to evaluate the performance of
our algorithms,
• Accuracy: Accuracy is a basic score in classifier
evaluation, which is calculated by equation (2):
c. Machine Learning Model Training and Evaluation
So far, Reverse RAT detection problem has become a
binary classification problem, taking the above four features
{XI,Xl,x3,x4} as input and yE {O,I} (I for RAT and afor normal
session) as output (I).
~
Reverse RAT
Legitimate sessions
Feature sessions
Average early stage packet 475 845
number
Average outbound TCP 78.74 146.8
navload leneth (byte)
Average inbound TCP 94.24 866.32
navload leneth (byte)
Percentage of sessions 24.3% 0.91%
havine heartbeat
Calculate out-in-
bytes-
ratio=(outbou nd_b
yte/inbound_byte)
Calculate the
packet interval time
Calculate PSH-flag-
"'-_--->J rati0=(PSH_pac ket_
number/session_pa
cket_number)
this the la
packet ofthe
session?
Fig. 2. Heartbeat packets detection Algorithm
it' ,:" " IIH" " " I l""'hl. I,'",!I" "' " cq",' I) w,,' t" 'low ,,, ,•.,•.•.<'1 , InJ~,I', an "qUill) th" n
Jlw r ll>.dFI".'I ' I
~ n ,t u l'I' If',,,,''II.,,,I FI,,'I
~ ~ nd if
III else
11 /,/,," +- !<I..: + I
11 ".,,,/1 +- " .,.11 + (','l.T tldl~ _ ,)
1:1 "'Kl if
II " ml whit"
1,. r "tur" flr-m 'I 'k at} I'..~'
m e n d (unetiu ll
PSH_packet_number Set
=PSH_packet_numb ' - - - - - - ---1early_stage_e nd_flag=
er+PSH_flag_value True
A 1llorith,n 1 H",,1I,,'>« D"1<, 't i' ,L Mo d,, :,'
I np ut : ~~,:kct h:·,.Li." ,
Out put : H:'artl ",'lt ~"l"..
l Ii.mctio ll H Efl.Tf<EAT U~ I1<(,T lD., 1'", 'htln!()Li,,' )
2 IrLr · - 0
I /{ ell rIl H',, 'YIIlJl '!- 0
I w lul" ! d~ < = 1 (~J (jrlf"hd " J ()L." I ) _ t; d o
if /H.,.;""tlnl",',i,o/[,,1.c) /0 f'lc kd/ IlfoL i81[ib : t 11)",c 3 1",i ~, oj " ,J' Iv".r',1. ",1.0,,"') 1",,-.,d .,
Set current packet to be
the next packet inthe list
Fig. 3. Network behavior feature extraction process
B. Data set
To conduct supervised machine learning algorithms, we
collect 370 real RAT traffics from open source community',
including RAT type of ghOst, Remcos, Nanocore, Adwind,
NetSupport Manager from year 2016 to 2018. Around 30% of
the collected RAT traffics are encrypted. As for normal
sessions, we collect 2190 sessions' traffic information from our
company network, covering application type of E-mail, QQ,
web browsing, P2P and cloud service.
The statistical features in Table II are calculated from the
collected data set, from which we can tell some differences
between reverse RAT and legitimate sessions. The average
early stage packet number of legitimate sessions is much
greater than that of RAT sessions. The gap between outbound
and inbound TCP payload length is much bigger of legitimate
sessions than that of RAT sessions. And the possibility of the
existence of heartbeat packets in RAT sessions is higher than
that in normal ones.
Accuracy Correctly Classified Sample Number (2)
Total Sample Number
• AUC: The RAT detection task is obviously imbalanced,
since the number of RAT sessions is much less than
normal sessions in practice. Ave is known to be a
reliable measure for imbalanced data set [4]. Receiver
Operating Characteristic (ROC) curve is created by
plotting the true positive rate (TPR) against the false
positive rate (FPR) at various threshold settings in
binary classification. AUC is the area under ROC curve.
As seen in Figure 4, all six algorithms achieve accuracy
higher than 0.92 and AUC higher than 0.87, which verifies that
machine learning is a feasible solution for reverse RAT
detection. Random Forest with 10 trees, SVM with linear
kernel and Logistics Regression have an average AUC of 0.954,
which is higher than 4NN, Gaussian Naive Bayes and decision
tree. And it indicates that the first 3 algorithms are more
capable of handling imbalanced data set. Meanwhile, decision
tree algorithin may not perfectly suit for imbalanced task since
it gets the lowest AUC of 0.87. Random Forest gets an
accuracy of 0.957 and AUC of 0.979, making it the optimal
solution for reverse RAT detection among all six algorithms.
www.malware-traffic-analysis.net www.contagiodump.blogspot.corn
www.capture.blogspot.com.
1009
4. A ccurac)' a nd AU C Va lue
,~..",.~,...._======
_ io.ecc _ l O·...UC
Fig. 1. Accuracy and AUC value of test algorithms
V. C ONCLUSIONS
In this paper, we introduce a reverse RAT detection method
based on network behavior features and machine learning
algorithms. Instead of inspect the payload of network traffic,
our approach uses only 4 features extracted from TCP headers,
making it efficient to detect RAT in real time. The proposed
method is mainly based on the fact that reverse RAT sessions
are more possible to have short early stages, heart beat packets,
PSH flags and send out more data than normal sessions.
Machine learning is able to solve this binary classification
problem according to our test on real data. Random Forest
performs the best with an accuracy of 0.957 and AUC of 0.979,
1010
and the performances of SVM and Logistic Regression
algorithms follows. Thus, our approach can detect unencrypted
and encrypted RAT sessions accurately and efficiently.
REFERENCES
[1] Michael, A , Sean, M., Christopher, C., & Aaron, 1. (2016) Hacking
Exposed Malware & Rootkits: Security Secrets and Solutions, 2nd edn,
McGraw-Hill Education, New York
[2] Li, w., Liu, H., & Zhang, X (2016) A network data security analysis
method based on DPI technology. 2016 7th IEEE International
Conference on Software Engineering and Service Science (ICSESS), 97-
976.
[3] Zhu, H., Tian, Z., & Xue, H. Practice of Automatic Monitoring Tool for
Boundary Port of Electric Power Information Network Hunan Electric
Power, 2017, (37):49-52.
[4] Reham, T., Nada, M., & Ayman, M. (2017) A survey on deep packet
inspection. Proceedings of ICCES 2017 12th International Conference
on Computer Engineering and Systems, 188-197.
[5] Dmitri, B., Bracha, S., Lior, R., & Ariel, B. (2015) Unknown Malware
Detection Using Network Traffic Classification. 2015 IEEE Conference
on Communications and NetworkSecurity, 134-142.
[6] Elias, R. & Xenofontas, D. (2014) IDS Alert Correlation in the Wild
With EDGe. IEEE Journal on Selected Areas in Communications, 1933-
1946.
[7] Shicong, 1., Xiaochun, Y., Yongzheng, Z., Yi, P., & Tao, Y. (2012). A
Novel Approach of Detecting Trojan Based on Network Behavior
Analysis. 2012 IEEE 14th International Conference on Communication
Technology, 513 - 518.
[8] Dan, J., & Kazumasa, O. (2015) An Approach to Detect Remote Access
Trojan in the Early Stage of Communication. 2015 IEEE 29th
International Conference on Advanced Information Networking and
Applications, 706-713.