SlideShare a Scribd company logo
1 of 35
Download to read offline
Seminars in Advanced Topics in Engineering in
Computer Science - Final Project:
A taxonomy of botnet detection approaches
Fabrizio Farinacci <farinacci.1530961@studenti.uniroma1.it>
September 3, 2017
Abstract
Botnets gained wide visibility in the last few years, becoming one of the most observed
and dangerous threats in the malware landscape. This is mainly due to the fact
that botnets represents a very valuable and flexible asset in the arsenal of hackers
and APTs. We’ve seen botnets becoming real cyber-weapons, capable of targeting
nations and the business of famous companies. For this reason, it is crucial to design
techniques able to detect bot-infected hosts at different levels (enterprise, ISP, etc.).
Different kind of techniques have been studied and researched in the last years, in
a never ending race between attackers and defenders. In this work a representative
set of the entire literature is analyzed, enlightening the different kind of state-of-the-
art approaches that researchers have followed with the ultimate goal of designing
effective botnet detection solutions. The objective is producing a taxonomy of the
botnet detection techniques, showing possible research directions in designing new
techniques to mitigate the risks associated to botnet based attacks.
1 Botnets: A brief introduction
When talking about botnets, we refer to a set of malware compromised machine, controlled
by one or more botmasters through commands submitted by means of specific channels,
without the users of those machines being aware of it. This imply for the bot machine to
become part of a large army of zombie hosts, devoted to perform malicious activities such
as DDoS attacks and spam campaigns, as established by the botmasters. Botnets can be
distinguished on the base of their structural, technological and behavioral characteristics,
as described in the following subsections.
1.1 Architecture
It’s the topology upon the Command and Control infrastructure used to govern the bots
it is built on. C2C architecture evolved over time to reduce the possibility of enumeration
and discovery of infected machines and increase the botnet resilience to shut down.
1
1.1.1 Client-server
The client-server model was used in the first type of botnets that appeared in the wild.
It was usually built on Internet Relay Chat (IRC), using IRC servers to send the command
to control the infected hosts or using Domains and Websites containing the list of all the
commands for the botnet to be controlled. In both cases, infected hosts needed to connect
to the IRC server or to the C2C domain to obtain the commands and perform their
malicious tasks. The main drawback, leading to the progressive disappering of the model,
is the fact that servers and domains are single-point-of-failure: most of the botnets have
been taken down in a matter of time and the use of techniques like Dynamic DNS and
Fast-Fluxing only managed to slow down the ultimately takedown. This is why hackers
moved to P2P solutions to increase botnet resilience and avoid takedown.
1.1.2 Peer-to-peer
Peer-to-peer (P2P) architectures are characterized by their topology flexibility and un-
predictability, that makes them more difficult to enumerate and discover, and consequently
more difficult to takedown completely. Newer botnets tend to be more and more based
on peer-to-peer, to reduce the risk of being shutdown. C2C is embedded directly into the
botnet hosts, rather than relying to an external server that may become a single point of
failure in the botnet functioning. Also, is very common to use public key cryptography to
secure the data relayed in the peer-to-peer network and identify commander hosts that
are also part of the peer-to-peer network along with their zombie counterparts. Each bot
knows only a list of peers to which send the commands that are then relayed to other
peers going deeper into the P2P network; this list usually includes around 256 peers that
makes the list small enough to be passed to other peers, fighting against botnet takedown
while allowing online bots to stay in contact. Even though peer-to-peer based botnets
are much harder to disrupt, they are not invulnerable against attacks or disruption. Two
common techniques to face P2P botnets are crawling and sinkholing. With crawling is
possible to enumerate all or most of the bots part of the network; once the bots have been
enumerated, sinkholing can be used to achieve distruption. It relies on the typical peer-list
flooding technique used by P2P botnets to achieve full coverage and works by injecting
in the peer-lists of all the bots of the network fake nodes that may be either controlled
by defenders or inexistent, making the bots pointing to a ”black hole” and modifying the
structure of the network turning it into a centralized system, that can be easily takedown.
1.2 General communication techniques
Another distinguishing feature of the C2C it’s the communication technique used to receive
commands and to send data. The communication channel plays an important role in
the malware persistance within the system: this is why newer malware often use very
”creative” solutions to go pass unnoticed and exploiting covert channels to avoid detection.
2
1.2.1 Domains (HTTP based solutions)
Using domains as C2C servers was one of the first solutions adopted by botnets: well
crafted domains or websites, containing all the commands for the zombie hosts, that only
has to connect to it to retrieve those by using simple HTTP requests. The main advantage
of this solution is that even big botnets can be easily maintained and updated by simply
update the content of the domain or website. The biggest disadvantage is that even fault
tolerant solutions with replicated servers can be quickly takedown by governments or may
also be easily targeted by denial-of-service attacks. Another disadvantage is the bandwidth
consumption for the domain, that is high compared to other solutions. This is why pure
domain based solutions are no longer used by malware developers. Instead, it may happen
that some phases of the botnet installation and initial setup may still be based on domain
servers, typically using dynamic DNS to change IP frequently and avoid being shutdown.
1.2.2 IRC
Another widely adopted solution was using the IRC (Internet Realy Chat) protocol and
IRC servers to serve as C2C servers. Infected clients connected to an infected IRC
server to join a IRC channel, created by the botmaster and dedicated to C2C traffic. The
botmaster can then simply send IRC messages to the channel to reach all its members
through message broadcast. The main advantage of IRC based solution is low bandwidth
consumptions because of the IRC protocol as communication protocol. The disadvantages
are forced simplicity and low shutdown resilience as in the domain case. It has also been
proved that keyword blaklisting has been effective in blocking IRC based botnets. For this
reasons, pure IRC based solutions are no longer adopted. Instead, the IRC protocol, for
its bandwidth consuption and proved simplicity, may still be used to carry the botnet’s
communication in combination with other solutions such as Tor and .onion domains.
1.2.3 P2P protocols
Along with the spread of P2P based architecture solutions, botnets started also using
existing P2P overlay protocols as communication channel for their C2C communica-
tions. A common example is the Kad network (based on UDP), a peer-to-peer network
implementing the Kademlia P2P overlay protocol in their overlay network. The first P2P
file-sharing programs relied on such network, using client programs supporting Kad net-
work implementation. Since these were very popular, especially in the the 2010s, malware
developer started to use the Kad network as a C2C covert channel. This is the case of
Alureon1 (aka TDSS), that according to Microsoft was one of the most active botnets in
the second quarter of 20102, that included encrypted communications and a decentralized
C2C relying on the Kad network.
1
https://en.wikipedia.org/wiki/Alureon
2
http://download.microsoft.com/download/8/1/B/81B3A25C-95A1-4BCD-88A4-2D3D0406CDEF/
Microsoft_Security_Intelligence_Report_volume_9_Battling_Botnets_English.pdf
3
1.2.4 DNS
DNS has been used (and abused) by malware developers due to its potential: it allows
to create and register a set of static or dynamically generated domains (by using Domain
generation algorithms), and continue changing the IPs the domains resolve to, avoiding
IP blacklisting and increasing dramatically the takedown effort for governments. DNS
is also very valuable for another reason: since DNS queries and responses are rarely
blocked by firewalls, it offers a very good solution to transmit and receive information
avoiding detection. DNS covert channels started to be increasingly used by malware to
transmit payload data, tunnel other application protocols and secure the DNS payload
with encryption, requiring very little investments and no complex infrastructure to work.
DNS request/responses gets formatted according to the DNS syntax, typically carrying a
text formatted resource record payload that may support a chunking mechanism, avoiding
the DNS resource record size limits (255 bytes). This meachanism has been exploited by
Feederbot [6], one of the first malware using DNS covert channels.
1.2.5 Tor
Even if centralized solutions suffered from easy identification and consequent shutdown,
in terms of simplicity and manageability, were the best from the botmaster perspective.
This is why during the last years[2] , a new trend started to appear in the wild: Tor
based botnets. Tor is an anonymous communication network based on the onion routing
protocol, in which the information is sent through a virtual cirtuit of randomly chosen relay
nodes (typically 3) part of the Tor network, in encrypted form, by negotiating symmetric
encryption keys with those ones. The real advantage for malware developers is another
feature: Hidden Services (HSs), that are web services published anonymously in the
Tor network and reachable without the need of knowing their location. The only thing that
must be known to get in touch with these ones is their descriptor, in form of .onion address.
Malware developers started to exploit HSs as more robust and resilient domains to control
their bots, that just needed to include a Tor client to connect to the HSs. Furthermore,
the HS can be located at infected machine side, creating distributed solutions even more
resilient to shutdown. This solution has been adopted by Skynet [15], the first botnet
found in the wild exploiting Tor and Hidden Services as communication channel.
1.2.6 Social networks
Very recently a very promising and threatening solution started to be exploited in the
wild: social network C&C. Malware developers started to be interested in social media
as command and control flow for many reason: the chat-like capabilities (remembering the
old times with IRC), the fact that access to the social media through HTTP and HTTPS
connection is rarely blocked by firewalls and also hardly identified as malicious by network
software monitors, the possibility to use the well-tested and powerful APIs of such sites
and most of all the extreme facility behind fake botmaster account (or multiple accounts)
creation. For this reasons, the use of social networks as C2C channel is a topic of high
interest at the moment and many studies on the real potentiality of this solution are still
under evaluation. A well known example is Stegobot [28], a very stealthy botnet exploiting
4
social networks and steganography, to encode commands into images that are then posted
in the social network for other bot members to see it.
1.3 Goals
Tipically, two are the generic goals in the mind of malware developers: monetization and
adversary defeat. This is also the reason behind the evolution of the generic malware
goals: some goals used to be very profitable/effective once but now are not anymore, this
is why are less frequently observed in the wild. Is very typical for a malware to have
multiple goals, to increase the revenue and improve reusability; this comes at the price of
sthealthiness: the more the activities, the easier for the botnet to be detected. Well-known
example of goals and activities performed by botnets are:
• E-mail spam, consisting in sending spam e-mail messages with different objectives
(advertisments, phishing, exc.); it use to be very profitable once (till late 2010s),
now it’s a rare activity.
• Credential sniffing and data exfiltration, to steal credential and sensitive/secret
data from the victim machines, gathered by the botmaster for later reuse or for selling
them on black markets.
• Denial-of-service attacks, that is probably the most typical activity of newer
botnets, turning them into real cyber-weapons to take down unwanted targets34 or
to put on rent to obtain profits5.
• Bitcoin mining, consisting in using the computational capabilities of the infected
machines to mine Bitcoins to obtain the reward.
• Click fraud, consisting in exploit the infected machine to obtain revenue by means
of pay-per-click (PPC) advertisments.
3
https://krebsonsecurity.com/2016/09/krebsonsecurity-hit-with-record-ddos/
4
http://www.theregister.co.uk/2016/10/21/dyn_dns_ddos_explained/
5
https://www.bleepingcomputer.com/news/security/you-can-now-rent-a-mirai-botnet-of-400-000-bots/
5
2 Botnet Detection: Problems and challenges
The detection of bot infections poses challenges for defenders, since attackers design their
malware having in mind the techniques used to detect bot infections; so it’s again the never
ending game between attackers and defenders. The first challenge for defenders is given
by the fact that the architecture used by botnets it’s constantly evolving both to increase
the robustness and the stealthiness against distruption and detection attempts, moving
from traditional IRC based and centralized solutions, to more stealthy HTTP based but
still centralized solution and even to P2P based solutions offering the most advantages
both from the point of view of robustness and stealthiness.
Another challenge it is posed by increasing the stealthiness of malicious activities, evolving
to more profitable and more harldy observable models. A typical example is the disap-
pearing of e-mail spam activity, easier to detect because of the employing of anomalous
DNS MX queries and SMTP traffic, in favor of the rise of more stealthy click fraud solu-
tions, employing more stealthier HTTP communication and being very profitable.
Finally, since the number of internet connected devices is monotonicaly increasing and
estimates states that by 2020 about 20 billion Internet of Things (IoT) devices, known
to be particularly vulnerable to attacks and malware infections, will be connected to the
Internet6, the design of scalable and performance efficient solution will be one of the major
challenges for researchers in the field of botnet detection in the next years.
3 Botnet detection approches taxonomy
To contrast the arising botnet phenomenom a wide set of possible detection strategies
has been designed, implemented and evaluated by the research community. Those de-
tection approaches can be categorized upon many different perspectives, describing their
capabilities and specific methodologies to face the problem of the detection (and possibly
characterization) of bot infected hosts at different network scales (enterprise, ISP, etc.). A
possible taxonomy of botnet detection methodologies is shown in Figure 1, where botnet
detection methods are characterized in terms of:
• Which botnet type they are designed to detect (3.1);
• What aspect of the botnet they are designed to track (3.2);
• What is the source from which the feature are extracted (3.3);
• How the feature used for the detection are extracted (3.4);
• How the collected features are correlated to derive conclusions (3.5);
• What is the algorithm used to detect the presence of botnets (3.6);
6
http://www.gartner.com/newsroom/id/3598917
6
Figure 1: Botnet detection approaches taxonomy.
7
In the following chapters, those characterization are first briefly introduced and then
detailled in terms of the state-of-the-art literature solutions presented in this work, showing
how the problem of botnet detection it is addressed by currently available solutions.
3.1 Botnet type
As explained in Section 1, botnet may differ on the base of structure, communication
technique and the malicious activities it performs once it has infected a target and it is
instucted by the botmaster to do so. This means that in the wild we may encounter a
wide variety of different kind of botnets we must be able to detect and identify. Also,
different characteristics means both that different challenges has to be overcomed and
that different kind of features can be leveraged to reach the detection goal. A possible
taxonomy, showing what are the different type of botnets the currently available detection
methods try to detect, is shown in Figure 2.
Figure 2: Botnet detection approaches taxonomy: Botnet type.
In brief, from the botnet type perspective, we have the following characterization:
• Centralized, focused on detecting botnets with centralized structure, that is the
most traditional case; those can be further classified into:
– IRC-based botnets: Is the most typical case, where the commands are sent
(pushed) through IRC channels and servers to control the zombie hosts;
– HTTP-based botnets: Another classical scheme, in which the commands are
retrieved (pulled) by the bots from the C&C using the HTTP protocol;
– Domain-Flux botnets: That employs Domain-Fluxing techniques in their
centralized structure to increase the botnet resilience against disruption at-
tempts; the bots may need to be equipped with Domain Generation Algorithms
(DGAs) to be able to get in touch with the CC server by dynamically generating
the current domain behind which this latter is hosted.
• Peer-to-peer (P2P), so the detection process is focused on detecting P2P and
distributed botnets, more robust than the centralized ones;
In addition, botnet detection techniques may also decide to not leverage particular struc-
tural/protocol characteristics of botnets: in this case, we say that the detection mechanism
is not tailored to a particular botnet type, but it’s rather generic.
8
3.1.1 Centralized botnets
Most of the initial effort put by researchers into designing effective and useful botnet de-
tection mechanisms focused on detecting centralized botnets. This is due to the fact that
the very first example of botnets were indeed based on a centralized structure, in which
bots were controlled by the botmaster through malicious C&C servers, oftenly hard-coded
directly in the bot logic. The predominance of this paradigm directly reflected on the
literature of those first years, dominated by centralized-based botnet detection methods.
IRC based: This is most typical case: due to its simplicity and low bandwidth require-
ments, the IRC protocol has been for long time the favorite mechanism of botmasters to
design botnets. For this reason, many are the proposed methods to detect those kind of
botnets [33] [25] [39] [9] [17] [11] [44] [14].
Goebel et al. [9] presented Rishi, a IRC botnet detection method employing n-gram analy-
sis scoring to identify bot-infected hosts, raising an alarm and reporting infected machines
to network administrators through automatically generated warning e-mails. Rishi is ca-
pable of detecting bot-infected machines by exploiting the fact that those must contact
their C&C server directly after infection by means of specific IRC channels that have par-
ticular characteristics which distinguishes them from non-malicious IRC channels; those
are particular naming conventions (e.g. country codes to understand the bot location,
big univoke numbers to avoid duplicate names, OS substring identifiers, etc.) observed in
bots, matching to regular expression of known botnets and whitelist/blacklist matching,
with those one maintained both statically (with manual updates by the maintainer) or
dynamically by Rishi itself, exploiting the same threshold scoring mechanism used for the
detection of bots. This mechanism enables Rishi, that is just a small Python script with
less that 2k lines of code, to achieve very good detection results.
Problem of Rishi is that it requires signatures and to perform packet inspection, which is
not always possible both because of timing constraints and the use of encryption techniques
by bots. Karasaridis et al. [17] designed a wide-scale botnet detection and characterization
method capable to characterize and identify bot-infected machines at Tier-1 ISP network
level by means of non-intrusive, header-based and scalable detection algorithms. Their
solution first identifies suspicious host involved in Candidate Controller Communications
(CCCs) exploiting IRC protocol traffic flow characteristics and communication involving
remote hosts appearing to be hub servers (i.e. host having multiple connections with
many suspected hosts). Then those CCCs are analyzed and aggregated by means of their
traffic characteristics and their maliciousness gets finally validated in terms of additional
sources of information (e.g. honeypot based detection, domain name validation, exc.) or
by means of the activities performed once the suspect hosts are observed (e.g scanning,
spamming, etc.) that is used in addition to characterized the particular bot infection and
the involved hosts by means of a similarity measure exploiting the observed activities.
9
HTTP based: Centralized botnets may also employ the HTTP protocol, to blend more
smoothly within the benign background traffic of the infected hosts. In the literature there
are not many works specifically tailored to HTTP based detection [7] [19]; but rather,
there are many methods whose autors claim to be effective against centralized botnets, in-
dependently if the communication protocol used is IRC or HTTP [17] [39] [11] [44] [4] [42].
Yen et al. [44] designed T ¯AMD (Traffic Aggregation for Malware Detection), a system
aiming at identifying even stealthy malware communication by exploiting the fact that
rarely the infection is constrained to a single host within big enterprise network and so ag-
gregating the traffic exploiting similar characteristics to infer the precence of an infection
in the network, that can also be a novelt and never observed one. So the goal is to iden-
tify aggregates of traffic involving multiple hosts sharing similar characteristics in terms
of common destinations, similar payload (i.e. string edit distance matching with moves)
and common host platform for justifying infections even in a platform spcific manner in
terms of the defined aggregates, leveraging no backgroud infection information but only
past knowledge about the normal traffic in the network and employing efficient algorithms
in the area of signal processing and data mining, achiving very good detection rate even
when a small percentage of host was indeed infected (0.0097%).
Domain-Flux botnet: To improve their resilience against takedown attempts, central-
ized botnets moved to more complex models employing Fast-Fluxing techniques, in which
a multiple set of hosts, each associated with a different IP address, register and de-register
to a domain name very quickly and repeatedly (employing Round-Robin like schemes).
This, combined with Domain Generation Algorithms (DGAs), enables to change dynam-
ically the domain name of the C&C servers without changing the botnet source code and
increases dramatically the takedown efforts. But this however comes at the price of the
botnet stealthiness: DGAs produces anomalous DNS traffic, enable to design techniques
capable to detect Fast-Flux-based botnets[43][4].
Yadav et al. [43] designed a botnet detection technique that exploits the features of
DGAs to detect Fast-Fluxing botnets. The core ideas are that (i) the domain names cre-
ated by DGAs have an high level of entropy (required to make it hard to predict them)
unlike the human generated ones and (ii) the bots generate many subsequent failed DNS
queries close to successful one. Those characteristics are exploited in the detection phase,
by implementing an on-line detection mechanism working in bins (i.e. time windows) and
performing time and entropy correlation to the DNS queries happening in specific bins. In
addition, to improve the efficiency, a staged filtering methods employing IP whitelisting, IP
degree (i.e. # domains resolving to that IP), temporal correlation across bins with proba-
bility of DNS failures threshold (to preserve only hosts with sufficient fail probability) and
succeding/failing domain set entropies (using edit distance for entropy calculations), en-
sures that only a small portion of traffic goes through all the steps, consequently reducing
the computational effort and enabling the system to process Tier-1 ISP generated traffic
with resonable latency, affording real-time detection with high detection accuracy.
10
3.1.2 Peer-to-peer (P2P) botnets
As already stated in Section 1.1, the P2P model started to be favored over the centralized
one due to its increased robustness. In addition, along with the rise of P2P file-sharing
applications, the capability of the botnet traffic to go pass unnoticed increased, making
it harder for the defenders to identify it within the possibly huge background traffic. For
this reason, researchers moved their focus to detection schemes capable to detect this new
kind of botnets, even in presence of background benign P2P traffic. Many are the botnet
detection methods in the literature specifically designed to detect P2P botnets and they
don’t lack in variety, with methods based on data mining [22], big data analytics [36],
machine learning [34], graph-based techniques [27] [5] [8] and more [31] [23] [46] [45] [3].
Chang et al. [3] designed a P2P botnet detection methods employing behavior clustering
and statistical tests. The intuition is that by monitoring the network prior infection, it
is possible to cluster active nodes within behavior clusters; this is done by selecting at-
tributes that favor the identification of normal behaviors within the network such as the
popularity of the destinations contacted by the nodes, since normal users tend to con-
tact a small range of destinations with different popularities while bots instead tend to
caontact an high number of destination independently from their popularity. The nodes
are clustered into behavior clusters by means of a classic agglomerative clustering algo-
rithm that initially consider each single node as a cluster that is expanded at each round
considering the pairwise distance among each current behavior cluster, measured by the
extended Jaccard distance and updating only if this one is below a certain threshold, until
termination when stability is reached. Then, the identified behavior clusters are used to
determine if a new behavior (i.e. a C&C behavior) suddendly introduced in the network:
this is done by performing two kind of statistical test that considers the l-most popular
behaviors and if the new data altered either (i) the proportion of nodes belongig to those
clusters or (ii) the intra-cluster distance amon those clusters. If a sufficient number of
those cluster gets altered, an alarm it is raised and the C&C behaviors are identified by
using the distance between the updated behaviors in the new observations and the closest
behavior in the old one, clustering nodes between C&C and normal behavior and yielding
to identification of all bots and low false positive rate (0.05%).
A problem in P2P botnet detection is that if malicious activities performed by bots are
hardly observable, detection of those sthealty botnets can be difficult in presence of be-
nign traffic originating from legitimate P2P applications. This problem has been faced
by Zhang et al. [45] [46], designing a detection method capable to identify sthealty P2P
botnets by means of statistical traffic fingerprints. The detection method in [46] involves
5 steps. First the (i) traffic is reduced by filtering connection involving DNS-resolved IP
addresses, since the vast majority of traffic generated by P2P applications is toward IP
destinations hard-coded or obatined by looking up IPs from routing tables of the over-
lay network. Then P2P clients are identified: (ii) coarse-grained detection of P2P clients
by exploiting the fact that P2P application perform many failed outgoing connection at-
tempts to remote peers to identify candidate P2P clients; (iii) fine-grained detection of
P2P clients by means of a flow clustering process that groups the similar flows of each
11
candidate P2P client (in terms of traffic characteristics like # of sentreceived packets
and bytes) into fingerprints clusters, enabling the characterization of P2P clients in terms
of their set of fingerprint clusters and dropping all candidates with no fingerprint clusters
or with destinations covering too few BGP prefixes (it is unlikely for a P2P application);
Finally P2P bots are identified among P2P clients: (iv) coarse-grained detection of P2P
bots by exploiting the fact that to maintain active their overlay network, P2P bots have to
stay alive to a time comparable to the one of the infected system and so retaining only the
candidate P2P bots having their estimated active time (i.e. the time between the last and
the first observed packet) comparable to the active time of the P2P application, estimated
by the statistical fingerprint having the maximum active time among the ones of the node;
(v) fine-grained detection of P2P bots that exploits the fact that bots in the same botnet
use the same protocol (i.e. the average number of bytes sentreceived is similar) and the
set of peers contacted by the bots of the same botnets largely overlaps (i.e. differently
from legitimate P2P application), defines two distance functions upon those observations
that are then used to group candidate P2P bots in clusters and preserving only the clus-
ters that are sufficiently dense, because the distance between two host is decided by the
minimum distance of fingerprint clusters (i.e. the set of contacted IPs largely overlaps)
that enables to signal those dense clusters as constituted by P2P bots of a specific fam-
ily. In addition to the already very promising results of [46], in [45] the performance of
the detection algorithm are improved by eliminating the time expensive step (ii) without
sacrificing its detection capabilities and parallelizing the other time expensive step (v)
employing a two-step clustering process applied to each single host that enables the work
parallelization by considering multiple computational nodes that first partition the hosts
into sets to produce multiple two-step clusters in the end aggregated by a cumulative func-
tion. This enables [45] to have 60% reduced storage costs due to bypass of (ii) and overall
68% processing time reduction, making the solution highly scalable and performant.
3.1.3 Generic botnets
Along with the methods that exploits the stuctural characteristics of the botnets, we
can find in the literature many detection methodologies that do not take advantage of
these properties in their detection algorithms. The advantage of this generalization is the
capability to detect a potential higher number of different botnets type, at the price of
having less features to exploit in the detection decision. In particular we can find detection
methods leveraging signature-based models [41] [21], machine learning [40] [48] [16], data
mining [26] [20] and more [12] [13] [24] [18]. Many of the proposals shown in this work
detects generic botnets (i.e. all those mentioning no specific botnet type); examples are
BotHunter [12] and BotMiner, by Gu et al. and presented in the later sections (3.3.2).
3.2 Detection target
Another way in which botnet detection methods can be classified is in terms of what signs
of infection they leverage to detect the presence of botnets within the monitored network.
Very simply, we have essentially two possibilities C&C detection or Bot activity detection
(Figure 3). Briefly, the characteristics of the two are:
12
Figure 3: Botnet detection approaches taxonomy: Detection target.
• C&C detection, exploits the fact that all the botnets require a command and
control channel for sending commands and controlling the bots, that can also be
used to report and exfiltrate information from the infected machine;
• Bot activity detection, exploits the fact that all botnets to be a valuable asset
for the botmaster have to perform some kind of malicious activity triggered by the
reception of a particular command sent by the botmaster itself.
In the following, the characteristics of the two are deepened and proposals from the liter-
ature are presented to give a better understanding of their different capabilities.
3.2.1 C&C detection
C&C detection leverages the fact that all the botnet must have a C&C channel to obtain
the commands given by the botmaster. There are many different possible channels and
communication techniques employed by botnets, as already described in Section 1.2. this
imply that are many the different techniques available to track the different kind of botnet
channels employed in practice. The advantage of this technique is that by focusing on the
detection of the C&C channel it is possible to achieve early detection and identify infec-
tions prior the execution of malicious operations. Clearly, if the communication channel
used is particularly sthealty (e.g. social networks) it could be difficult to detect the botnet
by looking only at the C&C communications. A big share of the work available in the
literature focuses on C&C detection, that is performed by employing different techniques:
supervised machine learning classifiers [25] [34] [48] [16] [19], clustering [3] [46] [45] [40],
data mining [22] [20], graph based approaches (mostly for P2P botnets) [5] [27] [8] and
many more [39] [9] [17] [44] [14] [31] [29] [42] [21] [7] [36].
Strayer et al. [39] proposed a system looking for evidence of command and control activity
by examining flow characteristics and identify evidences of botnet activity in the moni-
tored network. It exploits the fact that some botnets exhibit thight command and control
interactions between the involved actors, showing timeliness in the reception and answer-
ing to the command-response interactions. This is used to detect botnets employing IRC
channels by (i) filtering meaningless traffic through non-chat-like traffic and white/black
listing, (ii) flow-based classification using machine learning classfiers to reduce the traffic
by preserving only candidate IRC botnet flows, (iii) temporal and similar (in terms of
size) flows correlation to group similar flows into clusters of related flows on the base of
13
an Euclidean-like distance and (iv) topological analysis to identify topological structures
in the automatically generated graphs of the clustered flows (e.g. common endpoint, ren-
dezvous points, etc.) typical of IRC based botnets to suggest further investigation for the
hosts belonging to the suspicious cluster.
3.2.2 Bot activity detection
Since all the botnets, to be profitable from the point of view of the botmaster, sooner
or later will have to perform some malicious task, this fact can be used by the detection
engine to identify bot infections; this is exactly the case of bot activity detection tech-
niques. The advantage of this kind of techniques is that they are more likely to identify
infections, since malicious activities typically performed by the bots (except for data ex-
filtration, which is a very stealthy activity) are easier to observe with respect to C&C
communication, that can be very stealthy. The drawback is clear: we have to wait for
the bot to perform some malicious activity, sacfifying early detection advantages of C&C
detection. This is why some work proposed hybrid solutions combining C&C detection
with bot activity detection to improve the performance. A good number of solutions pro-
posed in the literature is built on bot activity detection; there are botnet life-cycle based
solutions [12] [18], group behavior analysis solutions [11] [13], signature based solutions
[33] [24] [41], DNS based solutions [4] [43] [35] and more [26] [23].
Many proposals described in this document focus on detecting the activities performed by
the bots rather than their C&C channels. This is the case of the work by Yadav et. al [43],
described in section 3.1.1, that exploits the anomalous DNS activity pattern generated by
Domain Generation Aalgorithms to detect botnets employing Fast-Fluxing techniques,
the case of BotHunter by Gu et al. [12], described in later sections (3.3.2), that leverages
the dialog warnings generated by an IDS deployed in the monitored network to detect the
botnets through the observed malicious activities and in particular is the case of BotTracer
by Liu et al. [24], described in section 3.3.1, that generates signature based models by
dynamically executing the malware in a virtualized environment, and detects bot infected
hosts by observing the activities performed by this one in the virtualized environment and
matching those in real time with the ones performed by the host in real world.
3.3 Feature source
Detection methods can be then characterized in terms of source of information from which
the feature used by the detection engine are extracted. The different source of information
employed in the field of botnet detection, shown in Figure 4, are divided into two categories:
Host-based and Network-based. Their differences are the following:
• Host-based information, obtained by monitoring and analyzing the internals of
a computer system for signs of infection, instead that looking at the network traffic
on the external interfaces;
• Network-based information, obtained by monitoring and analyzing the network
traffic to detect signals of botnet activities. It can be further divide into:
14
– Payload inspection, where the entire packet content is used to derive features;
– Network flows, where the header of packets is used to reconstruct the network
flows involving the monitored hosts and to derive features based on statistics
computed on the characteristics of these flows;
– DNS traffic, where anomalies in the kind, timing and amount of DNS traffic
observed in the network are used to derive features;
– Network connection graphs, where graphs are constructed on the base of the
observed connections involving the monitored hosts, with the goal to identify
structural patterns signaling bot infections;
– Botnet life-cycle pattern, where the typical life-cycle model, from infection
to malicious activity exhibition, observed in botnet infections it is used as an
anomaly pattern to identify infected machines by looking at the sequence of
states or activities observable in the network traffic they get through;
– Group activities, where monitored machines in the network are checked to
identify hosts performing operations similar in kind and timing to derive groups
of hosts, signifying the presence of multiple bot infected hosts in the same
network that are likely to belong to the same botnet;
Figure 4: Botnet detection approaches taxonomy: Feature source.
3.3.1 Host information
In this case, information from the underlying machine it’s leveraged to detect the presence
of bot-infection. This is not very common in practice: since botnet, due to their design,
are forced to produce network traffic, it’s much more typical to leverage network-based
15
information to tune the detection algorithms. However, there are in the literature some
example of detection mechanism leveraging host-based information gathered from the in-
ternal of the computer system to detect bot infections [33] [42] [24] [26].
Probably, one of the most representative example is given by BotTracer, proposed by
Liu et al. [24]. BotTracer works by exploiting the following observations: (i) bots are
designed to automatically start up at the system boot (typically) by modifying the auto-
matic startup process list or registry entries, differently from other withelisted (i.e. the
ones the user give the explicit consent for this to happen) applications within the system;
(ii) all the bots need to establish a C&C channel with the botmaster, typically few mo-
ments after their automatic start-up; (iii) sooner or later, all bot will have to perform some
malicious operation in the system, typically consisting in data exfiltration or participation
in coordinated attacks (e.g. DDoS attacks). BotTracer is capable to detect bot infections
by cloning the underlying system into a virtual machine; this virtual machine serves as
a real-time testbed: it runs in parallel to the real physical system, but without the noise
generated by user’s operations, to better detect the signs of bot infection (i-iii) whenever
this are happening also in the real one.
However, these kind of host-based systems are not meant to substiute network-based
ones, but rather to complement them, enforcing better detection capabilities.
3.3.2 Network information
The majority of the proposals found in the literature leverages network-based informa-
tion for the purpose of botnet detection. The reason is clear: all botnets must generate
some network traffic, at least to get in touch with their botmaster and receive further in-
structions. Since the traffic generated by those can be characterized in many dimensions,
methods using network information to derive their features further decompose in many
categories, described in details in the following.
Payload inspection: The entire content of packets is used to extract features for the de-
tection phase. This ideally enables to obatain more information from the observed traffic
at the price of more processing overhead. The main problem, that lead to a drastic drop in
the popularity of this method is the employing of encryption and obfuscation techniques
by newer botnets, impeding to derive meaningful knowledge from the content of packets.
A significant number of works employing payload-based information have been proposed
in the literature, employing payload information for signature generation [33] [9] [14] [41]
[42] [21], traffic and behavior aggregation [11] [13] [44] and more [12] [18].
Lee et al. [21] proposed an approach to automatically generating payload-based models
from botnet and benign traffic traces, to be then used to match packets against this set of
automatic generated signature and detect infections. The approach has two main phases:
(i) in the learning phase, packets of the input traces are first grouped accoring to their
size into clusters representing payload lenght ranges (containing both benign and botnet
packets), then token signatures consisting of byte sequences at specific offset are identified
16
within the packets of the clusters and dividing each range cluster in benign and botnet
signatures, to form the payload-based models. Then, to improve efficiency and reducing
the memory overhead, a model reduction steps is carred out to remove irrelevant signa-
ture values with low discriminative power or frequently occurring in the benign model
(leading to false positives) in the botnet model, turning this latter into a probabilistic
model that considers the probability of a certain token to appear at a specific offset across
all the tokens appearing in that position. Finally, (ii) the model is leveraged to match
the features extracted from observed packets and to trigger an alarm when a threshold of
matched signature is reached. The approach achieves high detection rates (94%) and low
false positive rate (0.9%), but like all the payload-based models may suffer if encryption
is used in the proper way (i.e. salting the original payload to create irrelevant signature).
Network flows: Information originating only from the packet header is extracted and
used to reconstruct network flows out of the network traces, for computing statistics
and characterizing the observed network traffic. Using only header information means
that some information will be dropped: the advantage here is making ineffective botnet
contermeasures such as packet payload encryption and obfuscation. Once the network
flow statistics are computed, many are the possibilities to derive the conclusions such as
data mining [26] [22] [20], machine learning [25] [34] [48] [36] [16], behavior clustering and
traffic aggregation [39] [3] [23] [46] [45] [40] [13] [11] and many more [17] [31] [5] [27] [8] [12].
Tegeler et al. [40] designed BotFinder, a system that detects bot infections leveraging
only high-level properties of the bot’s network traffic and without performing deep packet
inspection. The system works in two main phases: training phase and detection phase.
In the training phase, bot samples of different families are run in a testbed environment
to extract botnet traces; then, if flow information is not available, flows are reassembled
from the captured packet data (to obtain NetFlow data) and traces are extracted by
chronologically ordering and grouping all the flows between two endpoints. From the so
obtained trace, it is possible to extract the five relevant features used by BotMiner for
the botnet model creation: (i) average time interval between two flows in the trace (since
bots typically exploit regularity); (ii) average duration of the connections in the trace
(since bot connection have typically small duration); averages of (iii) source bytes and
(iv) destination bytes within the trace (since bots perfom many similar connections); (v)
Fast Fourier Transform (FFT) over a binary sampling of the trace, to detect underlying
communication regularities in the C&C connection. The models are then created by clus-
tering over the five relevant features, by processing the dataset once for each feature and
describing the model representing the botnet family by means of this five set of clusters.
In the detection phase, observed traces are matched againts the models, comparing each
statistical feature of the trace with the model’s clusters: we have an “hit” each time a
feature of the trace belongs to one of model’s clusters and increase the current score in a
way proportional to the quality of the cluster and of the trace’s feature. Different scores
are kept for each model: detection is triggered whenever the highest score is above a global
pre-defined acceptance threshold. The method achieves good detection (above 90%) and
false positive rates (around 1%) and showed capable of detecting novelt infections.
17
DNS-based: In addition, it is possible to use only the information obdtained from DNS
traffic. This because, as already stated in chapters 1.2.4 and 3.1.1, botnets may produce
an anomalous amount of DNS traffic for different reasons (e.g. DGAs, covert channels,
exc.). This fact can be leveraged to derive very valuable information from DNS traffic in-
formation and to extract features used for detection purposes. For this reason, we can find
in the literature some proposals employing only DNS traffic to derive detection conclusion
such as [43] described in chapter 3.1.1, exploiting the anomalous DNS traffic generated by
DGAs and other similar methods [4] [35].
Network connection graphs: These methods build the communication graph on the
base of the connections involving the hosts in the observed network to identify anomalous
structures and patterns typical of botnet infections [39] [5] [27] [8]. One of the advantage
of these methods is the possibility to design and use existing (e.g. PageRank7 [8]) powerful
graph and link analys algorithms to identify bot infections with high confidence. The clear
problem is that this kind of analysis has different results depending on the architecture
of the botnet to be detected; very good results are achieved for P2P botnets, while the
same is not true for centralized ones. However, the majority of new botnets are built
upon P2P overlay networks and many additional methods has been proved to be effective
against centralized botnets (3.1.1); so those methods offers a very good alternative to
other anomaly based ones and can be also combined to other techniques, such as cluster
analysis to achieve even better performances [8]. An example of graph based detection
mechnisms is the work by Coskun et al. [5], that proposed a detection method for de-
tecting P2P botnets by observing the fact that the member of those ones are very likely
to communicate with at least one common external bot during a given time window i.e.
they have a so-called mutual contact. The work, described in 3.5.2, leverages this fact to
construct the graph of mutual contacts and identify even dormant (i.e. having performed
no malicioustask yet) bot by running on this one a“Dye-Pumping” algorithm designed by
the authors and achieving very good results.
Botnet life-cycle pattern: The idea is to use the typical pattern botnet sample ex-
hibit upon infection, as an anomaly signaling botnet infections. More specifically, the
majority of botnets performs the following pattern of actions upon infection of some host
[12]: (i) external to internal inbound scan, to identify the target host to infect; (ii)external
to internal inbound exploit, to take control of the machine for downloading the malware
sample; (iii) internal to external binary acquisition, to download and install the malware
sample and infect the target machine; (iv) internal to external C&C communication, to
communicate with the C&C server and receive the commands given by the botmaster; (v)
internal to external outboud infection scanning, to scan for external vulnerable hosts to
infect with the malware sample and enlarge the botnet. This technique typically leverages
dialog warnings generated by IDS deployed within the perimeter of the monitored net-
work, collecting and correlating them for signals of the above activities (i-v): detection is
typically triggered when a sufficient number of dialog is observed. These techniques, being
7
https://en.wikipedia.org/wiki/PageRank
18
more general, are more flexible than model based ones and typically enables to detect an
higher number of infection. The only problem is that they typically have to go through
all (of most) of the above life-cycle steps (i-v), sacrificing early detection advantages.
Gu et al. [12] proposed BotHunter, which uses the dialog correlation strategy to detect
sign of bot infected hosts in the monitored network looking for the activities characterizing
the infection life cycle. BotHunter correlator is built on top of the IDS Snort8, using a
customized malware-focused rule set keep constantly updated by the Snort community.
In addition, to complement this signature based detection capability of Snort, BotHunter
extends the capabilities of this latter with two additional plug-ins: SLADE and SCADE.
SLADE (STatistical PayLoad Anomaly Detection Engine) is an anomaly-based engine
for payload exploit detection through lossy n-gram byte-distribution analysis, to identify
deviation from a normal traffic profile. SCADE (Statistical sCan Anomaly Detection En-
gine) is a scan detection plug-in constituted by two modules: an inbound scan detection
module, that tracks scan toward internal monitored hosts and calculates an anomaly score
based on the number of received probes weighted by the specific port receiving the probe
(two possibilities: High Severity for known highly vulnerable and exploited services and
Low Severity) and an outbound scan detection module based on a voting scheme (AND,
OR or MAJORITY ) of three independent anomaly detection models considering (i) the
outbound scan rate, for detecting hosts performing high-rate scans across a large set of
external hosts, (ii) the outbound connection failure rate, for detecting abnormally high
failure rates weighted by High/Low severity port usage and (iii) the normalized entropy
of scan distribution, calculating the distribution of outbound connection patterns to look
for uniformly distributed targets pattern typical of bot scans. Dialog generated by the
signature based and anomaly based engines are classified according to the botnet life-cycle
model activities (i-v) in a network dialog correlation matrix managed by the dialog cor-
relation engine, enlarging upon dialog generation and shrinking upon dialog expiration,
that works on the base of soft and hard time intervals, upon which dialog are pruned
and aggregated respectively. When a dialog sequence resulting from the dialog correlation
procedure is found to cross the threshold for bot declaration, a bot infection is alarmed
and a bot profile is produced, characterizing the infection on the base of the received
dialogs, its duration and its participants. For a long time, due to its capabilities and
the constant development, BotHunter has been considered as the de-facto state-of-the-art
for botnet detection and due to it’s open-sourceness (a stand-alone version can be freely
downloaded here9) used to compare the performance of newer botnet detection techniques.
Group activities: Those methods leverage the presence of anomalous temporal coordi-
nation in the kind of activities performed by the hosts of the monitored network. Botnets
are characterized by the presence of entire clusters of machines performing the same kind
of operation in coordination: this is given by the fact that multiple hosts are infected by
the same malware and sharing the same commands submitted by the botmaster control-
ling the entire botnet. This means that by monitoring the network, in case of multiple
8
https://www.snort.org/
9
https://www.metaflows.com/wiki/Stand-Alone_BotHunter
19
bot infections, we can observe groups of hosts performing similar operations in the same
time slots, hence sharing the same traffic behavior. The main advantage of this method
is that, by correlating the information from multiple hosts, there are stronger evidences
of infections, that reflects positively on the false positive and detection rates. Also, group
analysis based techniques are much more difficult to evade for botnet developers since co-
ordination among the bots characterizes all bot infections: if the botmaster gave specific
commands to each single bot to not show group behavior, then he would lose the advan-
tage of having many bots performing the same task that is the main advantage of botnets
and also making more difficult to control the botnet itself. He could structure the botnets
into different subset of hosts performing different operations, but this would increase the
botnet complexity and would not completely evade group behavior analysis techniques,
since he should also check that no two bots of the same subset ends up in the same network
(which is very complex and unattractive for botmasters). Hovever this method has a clear
drawback: it requires multiple hosts to be infected in the same network and by the same
malware to achieve good performances and have a reasonable number of false positives;
but, even if botnets typically show lateral spreading characteristics, this is not always the
case. For this reason, it is not uncommon to employ additional mechanisms to cope with
the possibility of unique (or few) infections. The kind of group activities that are typically
observable and used by botnet detection methods are found in the DNS traffic [4] [35] or
in the C&C communications and botnet activities [11] [13].
Gu et al. proposed BotSniffer [11] and the later BotMiner [13] that is probably the
most representative work in the category of group behavor analysis detection methods.
They work behind the principle that botnets, differently from normal activities, demon-
strate a synchronized and correlated behavior, both in receiving the command through the
C&C channel and in anserwering to this one responding in the same channel and/or per-
forming some activity. Their early work, BotSniffer [11], focused on detecting push-based
(e.g. IRC-based) and pull-based (e.g. HTTP-based) botnets by devising anomaly based
detection algorithms for detecting bots belonging to these models in a port-independent
way and with no prior knowledge. BotSniffer works by detecting spatial-temporal corre-
lations and similarities characterizing in the message and activity responses of the bots
belonging to the same botnet. After a traffic filtering and C&C-like protocol matcher to
drop unmeaningful traffic, it identifies message responses by monitoring specific C&C pro-
tocol messages (e.g. IRC PRIVMSG) and uses two anomaly detection modules, namely
Abnormally High Scan Rate and Weighted Failed Connection Rate for scan detection
(plus an additional modules for spam detection) to identify activity responses. Then,
in the correlation stage, it employs two anomaly-detection algorithms for group activ-
ity/message response analysis based on a Threshold Random Walk (TRW), enabling to
set strict bounds on the false positive/negative rates. Response-Crowd-Density-Check al-
gorithm, for each time window checks for the presence of a dense response crowd within
each group of hosts connecting to the same server: if there’s a fraction of clients within the
group with message/activity response behavior larger than a threshold, then we say that
they form a response crowd. If we observe multiple response crowd within the same group
we have high confidence that the group is likely part of a botnet or not and accept that
20
hypotesis in the statistical test, that is computed by observing many rounds in a TRW
until one of the threashold is reached. The same round based decision is performed for
the second algorithm, Response-Crowd-Homogeneity-Check, that instead of looking at the
density (i.e. the number of hosts) of the responses within the group, it looks at the homo-
geneity of the observed message responses (computed in terms of a n-gram analysis based
distance) observed within each time window for the hosts within that group, to identify
homogeneus crowds. In addition, there are specific checks that may lead to detection even
if only a single host is found infected, by exploiting the broadcast characteristics of IRC
channels and applying the same algorithms on the observed incoming messages rather than
observed outgoing ones for the IRC-based ones and the periodical visiting patterns for the
HTTP-based ones. BotMiner [13] instead, makes no assumption on the C&C channel or
architecture; the original idea of BotSniffer is improved by employing the machine learn-
ing based technique of clustering, to clusters similar message response and C&C traffic in
a C-plane specific for C&C communication traffic clustering and similar activity response
in a A-plane specific for malicious activity clustering and then performing cross-plane
correlation cross-checking clusters in the two planes to find out intersection and reinforce
evidence of bot infection. First, the C-plane and A-plane monitor, deployed at the edge of
the network, run in parallel and capture respectively who is talking to whom, by tracking
UDP and TCP flows, and who is doing what, analyzing the outbound traffic for signs of
malicious activities such as scanning (in the same way as BotSniffer), spamming, binary
downloading and exploit activities detection implemented as Snort plug-ins. Then, for
C-plane clustering, after traffic basic and white-listing filtering, vector representation of
C-flows are computed by extracting meaningful features such as the number of flows per
hour and the number of packets per flow and a two-step clustering process clusters first
similar flows on a reduced set of dimensions and then refines the results by generating
smaller and more precise clusters by running a clustering process on each step-1 cluster by
considering all the dimensions. Instead, for A-plane clustering, for all the host performing
at least one malicious activity, those are clustered first according to the type of the activ-
ity; then for each activity type, hosts are further clustered into smaller and more precise
clusters according to the specific activity features. Finally, C-plane and A-plane clusters
are cross-checked to find out intersection and reinforce the evidence of bot infection. In
particular, for all the hosts having performed at least one malicious activity, a botnet score
is computed by assigning an activity weight to each cluster and summing up the weights
of the clusters this host belongs to; thus an host will have an higher score if it ends up in
more A-clusters or if ends up in a C-cluster with large overlap with A-clusters. For those
hosts, a similarity metric is computed looking at the set of A-plane and C-plane clusters it
belongs to taken as a bit-mask: this similarity is then used to apply hierarchical clustering
and to build a dendrogram encoding the relationship between hosts, to identify dense and
well separated clusters of (sub-)botnets with very good results (99, 6% in the worst case).
3.4 Feature extraction
One way in which botnet detection methods differs is the way in which the features used
for analysis are extracted. As depicted by Figure 5, there are two main way of thinking:
passive approaches and active approaches.
21
Figure 5: Botnet detection approaches taxonomy: Feature extraction.
Passive approaches are the most typical ones and the ones traditionally used in the
almost totality of botnet detection mechanisms. They involve detecting the botnet only by
observing and analyzing their activities and connections without actively participating in
their operations. Instead, Active approaches are mainly used by researchers for studying
and characterizing specific botnet samples and families, understanding their behavioral
and distinctive features. They are “active” because, differently from the passive ones,
they involve actively participating into the botnet operations by controlling one or more
active nodes belonging to (or at least pretending to) the botnet. These can be divided in:
• Fast-flux tracking: Monitor and identify DNS servers with low TTL values that
may indicate the presence of Fast-Flux Service Network (FFSN) mechanism typically
employed by botnets; this technique has been explored by Passerini et al. [32], that
proposed FluXOR, a tool aiming at identifying and characterizing FFSNs exploiting
feature based on (i) the TTL of DNS resource records (FFSN ones are short-lived),
(ii) the number of distinct IP addresses the domain is resolved to (high for FFSNs, to
have high availability) and (iii) the heterogeneity of the organizations owning these
IP addresses (high for FFSNs), that are then combined by means of a Na¨ıve Bayes
Classifier to classify and identify FFSNs (387 in just two months) that the researchers
believe be associated with at least 16 botnets. Similar results are obtained in the
later work by Zhao et al. [47], proving the effectness of this methods also on P2P
based bots using FFSNs for their malware distribution servers like Storm10.
10
https://en.wikipedia.org/wiki/Storm_botnet
22
• C&C server hijack: C&C servers/botmaster peers are seized to discover infor-
mation on the botnet topology and the list of involved peers. Seizure of the C&C
servers/botmaster peers could be either:
– Physical, when defenders/researchers physically hijacks the botmaster con-
trolled C&C server/peers; this is the case of Stone-Gross et al. [37], that
obtained the control of 16 C&C servers of the Cutwail botnet11 by getting in
touch with the hosting providers these server belonged to, showing evidences
of bot infections and obtaining the access to those servers, enabling them to
control the entire botnet for research purposes.
– Virtual, when defenders/researchers manage to redirect C&C communications
to a machine controlled by those ones; this technique has been employed by
Stone-Gross et al. [38], that were able to obtain the control of C&C server of
the Torpig botnet12 by reverse engineering the Domain Generation Algorithm
of the botnet and registering the future domains in advance, enabling them to
control and study the botnet operation for ten days.
• Infiltration: A defender-controlled machine masquerades as a bot and probes the
C&C server/P2P bot peers to progressively gain increasing information on the botnet
protocol and on other peer participating the botnet. This approach has been used
by Xu et al. in Cyberprobe [29] and the later Autoprobe [42], that by infiltrating
in the botnet operations were able to identify previously unknown C&C servers in
Internet-wide scale. They work by learning traffic fingerprints through the replay of
the messages obsverved in known malicious traces and comparing the responses in
the case of Cyberprobe and by sample execution, code branch exploration and trace
analysis to extract a set of symbolic equation used to fingerprint the candidate C&C
server response in the case of Autoprobe.
• Honeypot based: Deliberately vulnerable machines are exposed in the Internet
to get infected by malware samples, that are then studied by defenders/researchers
to prevent the compromise of real machines; this technique has been studied and
deeply researched by the Honeynet project13 and presented in the work by B¨acher
et al. [33], where honeypots are safely infected exploiting a firewall (honeywall)
to trap the botnet infection by controlling the outgoing traffic and blocking its
communication to the outside and enabling to study the sample behavior in safety.
• Sinkholing: Defenders/researchers controlled machines are injected into P2P bot-
nets peer lists with the goal of bots enumeration and botnet mitigation by incre-
mentally pushing all the bots to interact only with the sinkhole peers; this technique
is typically used to seize and study P2P botnets such as ZeroAccess14, sinkholed by
the security researchers at Symantec. [1].
11
https://en.wikipedia.org/wiki/Cutwail_botnet
12
https://en.wikipedia.org/wiki/Torpig
13
https://www.honeynet.org
14
https://en.wikipedia.org/wiki/ZeroAccess_botnet
23
• DNS cache snooping: DNS servers cache is analyzed to identify illegitimate or
unexpected DNS queries, the presence of known or suspicious DNS records and
how much a domain it is frequently queried, signaling the presence of a botnet; the
capabilities and possible use-cases of this technique are explored by Grangeia [10].
• Suppression: Incoming/outgoing packets in suspicious network flows are suppressed
to elicit known response from any of the ends of the C&C communication; the goal
can be the one to trigger bot C&C back-up mechanisms, by blocking communi-
cation to primary C&C server, as proposed by Neugschwandtner et al. [30], that
developed a tool called SQUEEZE, capable of detecting back-up C&C servers of
malware through a dynamic execution based on multipath exploration, able to re-
vert the malware to its state when performing a connection thorugh virtual machine
snapshotting, to explore both branches of allowing and blocking the connection.
• Injection: Works by identifying C&C communications by injecting packets within
suspicious network flows and by checking the similarity of the responses to the in-
jected packets with known bot responses for C&C channel identification; this tech-
nique is leveraged by BotProbe, proposed by Gu et al. [14], to separate human
chat-like IRC communications from bot IRC communication using a set of 4 hyphote-
sis test leveraging (i) a challenge base Turing-Test-Hypothesis, (ii) a Single-Binary-
Response-Hypothesis to check wether a client response is observed, (iii) a Correlation-
Response-Hypotesis test to check the homogenety of responses to the same message
and (iv) a Interleaved-Binary-Response-Hypothesis test to check changes in the re-
sponses when modified variants of the same message are sent.
3.5 Feature correlation
Botnet detection methods also differentiates in the way the different features collected
for detection purposes are analyzed and combined to derive conclusions regarding the
detection of possible infections. In practice, what happens is that many small bits of
information from potentially different kind of sources are correlated to obtain an higher
level summarized knowledge, later used in the detection phase; the different ways in which
feature can be correlated are shown in Figure 6. Two are the possible correlation schemes:
Figure 6: Botnet detection approaches taxonomy: Feature correlation.
• Vertical, correlating the sequence (or history) of activities performed by a single
host that are then compared with known models of bot behavior;
24
• Horizontal, correlating the activities performed by different hosts in the monitored
network and detecting bots by observing similarities in the timing and kind of op-
erations, to detect coordinated activities that are typical of botnet infections.
3.5.1 Vertical correlation
The detection is triggered by correlating a chain of suspicious activities performed by a
single host, signaling a bot infection at this one premises. Many of the literature proposal
employ vertical based correlation schemes to derive the proper detection knowledge and
they variety confirms the flexibility of this correlation scheme: there are data mining based
solutions [26] [22] [20], machine learning based classifiers [25] [34] [48] [16] [19], NIDS based
hybrid solutions [12] [18] and more [17] [31] [7] [35] [3] [40] [43] [24] [14] [41] [29] [42] [21].
Masud et al. [26] proposed an approach employing data mining techniques to two different
kind of log files taken at a specific host: (i) a tcpdump log file, taken with WinDump15,
containing the packet trace captured at the host and an (ii) exedump log file, generated
by a process tracer implemented by the researchers to track all the application lauched
by the host. Traces were collected from uninfected and infected machines (using virtual
machines, running Windows XP and infected with malware samples) to obtain a dataset
used to extract flow-based features through data mining techniques and used to train and
test machine learning based classifiers. The idea is to correlate the events between the
two logs to detect cause-effect events upon command submission typical of bot infections,
namely observable commands: (i) bot-response, when the command solicit a response from
the infected host, (ii) bot-app, when the command causes an app to start and (iii) bot-
other, when the command causes the infected host to contact some other infected host.
Upon this observable commands, statistical features are built on, by correlating the events
between the two logs and are used to derive classifiers to detect future bot infections by
mining real-time logs to extract the feature used to feed the classification task.
3.5.2 Horizontal correlation
In this case, the correlation does not focus on a specific host, but rather exploits the fact
that bot infection are characterized by the presence of coordinated activities performed
by multiple hosts sharing the same infection. This is caused by the fact that commands
are issued on a coarse-grain base, typically involving entire portions of the overall botnet:
so it’s very likely, for multiple infections of the same malware, to exhibit coordination in
performing some activities (e.g. querying a specific domain, DoS-attacking a specific host,
etc.). This kind of correlation can be more effective in detecting infections, but it fails
miserably if there are few or a single infection in the monitored network. This is why
are typically paired with some additional analysis mechanism, such as a vertical one or
a signature based one. In the literature we can identifiy different possible solution built
upon horizontal based correlation schemes: some of them employ group behavior analysis
[11] [13] [4] [35], graph based approaches [5] [8] [27], clustering based classification [3] [23]
[46] [45] [31] and more [39] [44] [36].
15
https://www.winpcap.org/windump/
25
Coskun et al. [5] proposed a detection method that aims to detect P2P botnets by
observing the fact that the member of those ones are very likely to talk to at least one
common external bot in a given time window. In other words, depending on the size
on the botnet, there’s a significant probability for two peers belonging to the same P2P
botnet to have at least one mutual contact in a given time frame. Using this information,
a mutual contact graph it is built, where there’s an edge between each pair of nodes having
at least a mutual contact, with a weight put on top of it representing the cardinality of
the mutual contact set for this two nodes. To avoid false positives (i.e. connection to a
very popular external server, like google.com), only private mutual contact, communicating
with less that a privacy-threshold k host, are considered within the possible mutal contact
of two nodes. For computing which hosts is likely belonging to a P2P botnet, an algorithm
(“Dye-Pumping” algorithm) it is run on top of the graph, to compute the confidence level
of the host being part of the same P2P botnet of the seed node (i.e. a node that have been
observed performing malicious task or for which there are evidences of bot infections) and
from which the algorithm is started. The algorithms starts by assigning the seed the dye
it can move to each of its neighbour nodes: for each edge, the assigned dye is the ratio
between the mutual contact of those nodes and the degree of the seed node, powered to a
Node Degree Sensitivity Coefficient to balance the dye assigned to high degree nodes. The
algoritm goes on for a fixed maximum number of iteration, pumping the available dye to
the neighbours in proportion to the weight of the edges. When the algorithm terminates,
the nodes having an amount of dye (representing the confidence level for that node be-
ing part of the botnet) greather that a threshold are detected as P2P bots, enabling the
identification also of dormant bots that didn’t reveal their malicious nature.
3.6 Detection method
Finally, botnet detection solutions can be characterized in terms of the algorithms analyz-
ing the input data and deriving conclusions, so how the detection and the identification of
bots it is carried out. As Figure 7 shows, algorithms can be divided into two groups, bor-
rowing mainly from the fields of machine learning and data mining, whose performances
have been proved to be very high in many areas in the field of data analysis. These are:
• Statistical classification, where observation are assigned the proper class label,
that is inferred by a classification algorithm; the classification procedure can be:
– Threshold based, where an anomaly score for the observations is computed
and then compared against a threshold, tuned up on the collected observations,
to infer the right class label;
– Supervised learning based, where a supervised learning classifier is trained
on top of the collected data to infer the right class label for future observations.
• Cluster analysis, where observation are grouped into clusters of similar objects,
grouped according to a specific similarity measure;
26
Figure 7: Botnet detection approaches taxonomy: Detection algorithm.
3.6.1 Statistical classification
Those methods solve the problem of identifying in which set of classes observations belong
to, on the base of the knowledge learnt from a labeled training dataset, that is essentialy a
set of observations with a known and verified classification. The goal is obtaining a clas-
sification algorithm or function capable of inferring bot observation with high accuracy.
There are in general two possibilities: Threshold based classifiers and Supervised learning
based classifiers, borrowing from the field of machine learning.
Threshold based: The principle behind this class of methods is very simple: classify as
malicious all the observations looking “too much” suspicious. The thongs that differenti-
ates the available methods falling in this category are the way in which the suspiciousness
it is computed and at what point a suspicious observation it is labeled as a bot one: namely,
they differentiates in terms of the anomaly score calculation and the threshold tuning. The
anomaly score can be computed on the base of the anomalies in the DNS traffic generated
[4] [43] [35], in the timing and content of the responses signaling automated bot activity
[14] [29] [42], in the quantity and kind of alarms generated by network monitor software
[12] [18] and more [24] [41] [21] [5] [7]. For what concerning the threshold, in most of the
cases it is decided experimentally [9] [5] [4] [43], or at least are the weights for the score
to be decided in this way [9] [12]; alternatively, it can be computed algorithmically by
employing techniques such as the Threshold Random Walk (TRW)[14].
Example of threshold based techniques are Yadav et al. [43], employing a threshold
based technique considering an experimentally computed threshold for anomalies in the
DNS traffic, Rishi by Goebel et al. [9], computing a weighted anomaly score to identify
anomalous IRC traffic (both described in section 3.1.1), BotHunter by Gu et al. [12], com-
puting a weighted anomaly score for the events generated by the Snort IDS and generating
profiles of bot infections for all those chain of events exceeding the threshold (descibed in
3.3.2) and more techniques descibed in the previous sections [5] [24] [21].
Supervised learning based: According to the machine learning terminology, we la-
bel supervised learning classifiers all those classifiers where the knowledge is extracted
from an already labeled training set considered as the ground-truth upon which the class-
fier is built on. In this case, the composition of the dataset it is critical: to avoid the
27
resulting classifier to underfit (i.e. not capture the underlying trend of the data) or to
overfit (i.e. describe the random error or the noise instead of the underlying relationship),
quantity and variety are essential properties for the training dataset. Also, a small part of
the dataset has to be reserved for testing the classifier: the classifier must be able to cope
with unseen behavior, that is possible only in case of not overfitted/underfitted classifiers;
so the testing is a critical step for the evalution of the classifier’s performance. In case of
fine tuning, supervised learning classification enables for both very high detection rates
and very low false positive and false negative rates, explaining the growing interest behind
this detection methods. Since there are many different possibilities, it’s very typical to
train multiple different classifier and choose then the one showing the more promising
results. There are works evaluating Na¨ıve Bayes and J48 decision tree [39] [25] [22],
Random Forest [36], Boosted Decision Trees and Support Vector Machines (SVM) [20],
Artificial Neural Networks [19], and more [48] [16] [26] [34].
Kirubavathi et al. [20] proposed a botnet detection method via the mining of traffic
flow characteristics, that are then used to train machine learning classifiers to achieve
high precision and recall rates. The key feature behind the performace of the proposed
method is the quality of the features mined from the network flows: by extracting only 4
features, it is cabable to precisely model the most relevant characteristics of botnet traffic
to identify bot infected hosts. Those features are: (i) Small packets(Ps), capturing the
fact that automated bot behavior is characterized by the exchange of many small packets
(in the range of 40-320 bytes) due to their activities of scanning, information harvest-
ing and maintainment of the C&C network (especially in the case of P2P botnets); (ii)
Packet ratio (Pr), capturing the fact that pre-programmed bots are capable of performing
a limited number of (pre-programmed) activities, thus limiting the obverved behavior and
so the possible fluctuations in the traffic generated by them (differently from normal traf-
fic, that fluctuates much), that is captured by the ratio between the number of incoming
and outgoing packets in a given time window; (iii) Initial Packet length (Pl), that cap-
tures the fact that the initial exchange of packets follows well defined behavior depending
on the (botnet) communication protocol, containing information that can be successfully
used to identify the flow as malicious or benign; (iv) Bot Response Packet (BRp), that
captures the fact that for automated bots, when an ingoing packet originating from some
specific IP-port combination is received, a response it is observed to the same IP-port
destination with typically constant response time. Out of this four features, a flow vector
it is constructed for each time window, out of the values of those ones in that time window
(N.B. the Initial Packet lenght it is calculated only once when the flow is established and
then carried to future time windows), whose duration is a critical parameter representing
the trade off between accuracy and timeliness of the detection and thus must be carefully
tuned through extensive evaluation. The first achievement of this work is showing that is
not the number of features that makes the difference, rather their quality; in addition, a
small number of features enables for faster and more efficient detection. Three machine
learning classifiers are trained and 10-fold cross-validated with datasets containing the
flow vectors of benign (FTP services, Web services, BitTorrent applications and more)
and malicious (Zeus, SpyEye, BlackEnergy botnets and many more) traces created by the
28
researchers in a testbed environment or obtained by leveraging public datasets (e.g. ISOT
botnet dataset16), assembling three vast and various dataset to train and test the classi-
fiers. The machine learning classifiers used were (i) Boosted Decision Tree, using AdaBoost
ensemble algorithm to construct a strong classifier out of a weighted linear combination of
weak classifiers. (ii) Na¨ıve Bayes Classifier, a probabilistic classifier based on the Bayes
theorem analyzing the relationship between the features and the corresponding class to
derive the likelihood (a conditional probability) linking the features to the classes, and (iii)
Support Vector Machine, a widely used classification technique trasforming the dataset to
an higher dimension where the data instances of the (typically two) classes can be divided
into different spaces thorugh a separation hyperplane maximizing the margin between the
two closest data point of the different spaces. High precision and recall rates are achieved
for all the three classifier for all the three datasets, with very high time efficiency and
Na¨ıve Bayes Classifier showing the best performance both in detection and efficency.
3.6.2 Cluster analysis
These methods, differently from statistical classification that tries to identify to which
class a particular observation belongs to, solve the problem of grouping in clusters obser-
vations in such a way that the similarity with the other observations in the same cluster
is greather that the ones belonging to other clusters. In other words, the typical goal
of cluster analysis is to identify the clusters in way such that the intra-cluster similarity
is maximized, while the inter-cluster similarity is minimized, identifying dense and well
defined groups of objects. In the machine learning terminology, cluster analysis is con-
sidered an instance of unsupervised learning, since they work on unlabeled data having no
classification included in the observation; thus, in this case no evaluation on the accuracy
of the structure that is output by the algorithm it is performed. Are parameters of the
clustering procedure the similarity (or distance) function used and a density threshold or
the specific number of clusters, that can be also computed iteratively by incrementally
adjust the number of desired clusters and pick the value that gives the higher quality
clusters. The similarity metric used in the clustering process is probably the feature that
characterizes the most the different methods proposed in the literature; there are clus-
ter analysis methods clustering observations on the base of the flow 5-dimensional tuple
(IPS, PortS, IPD, PortD, Protocol) simply and then perform other consideration on the
identified flows [17] [36], on the base of computed traffic flow characteristics [39] [3] [40],
on the base of C&C communication and/or performed malicious activities [13] [23], graph
structural characteristics [8] and more [44] [46] [45]. Many of the methods presented in
the previous sections are based on cluster analysis; examples are T ¯AMD proposed by Yen
et al. [44], presented in 3.1.1 and clustering on the base of statistical features extracted
by the flows, BotMiner proposed by Gu et al. [13], presented in 3.3.2 and clustering ob-
servations on two independent planes on the base of C&C communication (C-plane) and
performed malicious activities (A-plane) to later perform cross cluster correlation through
hierarchical clustering, and the work by Zhang et al. [46] [45], presented in 3.1.2 applying
multiple steps of coarse-grain and fine-grain clustering first to identify P2P hosts and then
16
http://www.uvic.ca/engineering/ece/isot/datasets/
29
P2P bots, by finding fingerprint clusters characterizing bot behavior and measuring the
similarity of those ones to identify bot families.
4 Research directions and conclusions
The current detection methods proposed in the literature shows very high performance in
terms of detection rates and false positive/negative rates, meaning that the current techn-
niques mostly leveraging data mining and machine learning algorithms combined with the
more traditional anomaly and signature based approaches will offer a well-tested and solid
base ground for future botnet detection approaches. However, Internet of Things (IoT),
Machine-to-Machine (M2M) and similar applications will introduce new challenging task
from the scalability point of view: the more the devices, the more the traffic that is gener-
ated. So, the design of scalable and distributed solutions will be the future challenge that
researchers in the field of botnet detection will have to overcome. Some researchers already
focused on scalability, mostly in relation to P2P botnet detection [46] [45] (described in
section 3.1.2), by employing an efficient combination of coarse-grained and fine-grained
clustering steps, reducing the high computational costs typical of cluster analysis based
solutions and enabling the detection of stealthy P2P botnets, even when blended within
benign P2P traffic. Even more benefits can be obtained by adopting distributed com-
putations at least in the steps requiring the most computational effort; for example,[45]
increases the performace of the prototype work [46] by performing a two-step clustering
for the cost expensive task of performing fine-grain P2P host detection dividing the set of
hosts (whose flows can be analyze singularly) across many computational nodes to identify
1-step clusters, whose results are aggregated using a cumulative function. One of the most
representative works showing the power of distributed computing in the field of botnet
detection is the one proposed by Singh et al. [36], employing the Hadoop implementa-
tion MapReduce paradigm both for feature extraction, by submitting traffic traces to the
Hadoop Distributed File System (HDFS) and letting Apache Hive17 extract the flow based
feature using as key the flow 5-dimensional tuple (IPS, PortS, IPD, PortD, Protocol) for
traffic flow aggreagation, and classification, by employing Apache Mahout18, the machine
learning library built on top of Hadoop, to train a Random Forest classifier and achieving
high true positive rates (above 99%) and low false positive rates (below 0.3%). Conclud-
ing, the new challenges introduced by the newest paradigms will require the design of
scalable and distributed performance efficient solutions, leveraging the integration with
well-tested and well-known parallel and distributed programming frameworks.
References
[1] R. Gibb A. Neville. ZeroAccess indepth. Tech. rep. Symantec Security Response,
2013. url: http://www.symantec.com/content/en/us/enterprise/media/
security_response/whitepapers/zeroaccess_indepth.pdf.
17
https://hive.apache.org
18
http://mahout.apache.org
30
[2] M. Casenove and A. Miraglia. “Botnet over Tor: The illusion of hiding.” In: 2014
6th International Conference On Cyber Conflict (CyCon 2014). Vrije Universiteit:
NATO CCD COE, 2014, pp. 273–282. url: https://ccdcoe.org/cycon/2014/
proceedings/d3r2s3_casenove.pdf.
[3] Su Chang and Thomas E. Daniels. “P2P Botnet Detection Using Behavior Clus-
tering Statistical Tests”. In: Proceedings of the 2Nd ACM Workshop on Security
and Artificial Intelligence. AISec ’09. Chicago, Illinois, USA: ACM, 2009, pp. 23–30.
isbn: 978-1-60558-781-3. doi: 10.1145/1654988.1654996. url: http://doi.acm.
org/10.1145/1654988.1654996.
[4] H. Choi et al. “Botnet Detection by Monitoring Group Activities in DNS Traffic”.
In: 7th IEEE International Conference on Computer and Information Technology
(CIT 2007). Oct. 2007, pp. 715–720. doi: 10.1109/CIT.2007.90.
[5] Baris Coskun, Sven Dietrich, and Nasir Memon. “Friends of an Enemy: Identifying
Local Members of Peer-to-peer Botnets Using Mutual Contacts”. In: Proceedings of
the 26th Annual Computer Security Applications Conference. ACSAC ’10. Austin,
Texas, USA: ACM, 2010, pp. 131–140. isbn: 978-1-4503-0133-6. doi: 10.1145/
1920261.1920283. url: http://doi.acm.org/10.1145/1920261.1920283.
[6] Christian J. Dietrich et al. “On Botnets That Use DNS for Command and Control”.
In: Proceedings of the 2011 Seventh European Conference on Computer Network
Defense. EC2ND ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 9–
16. isbn: 978-0-7695-4762-6. doi: 10.1109/EC2ND.2011.16. url: http://dx.doi.
org/10.1109/EC2ND.2011.16.
[7] M. Eslahi, H. Hashim, and N. M. Tahir. “An efficient false alarm reduction ap-
proach in HTTP-based botnet detection”. In: 2013 IEEE Symposium on Computers
Informatics (ISCI). Apr. 2013, pp. 201–205. doi: 10.1109/ISCI.2013.6612403.
[8] J´erˆome Fran¸cois et al. “BotTrack: Tracking Botnets Using NetFlow and PageRank”.
In: Proceedings of the 10th International IFIP TC 6 Conference on Networking - Vol-
ume Part I. NETWORKING’11. Valencia, Spain: Springer-Verlag, 2011, pp. 1–14.
isbn: 978-3-642-20756-3. url: http://dl.acm.org/citation.cfm?id=2008780.
2008782.
[9] Jan Goebel and Thorsten Holz. “Rishi: Identify Bot Contaminated Hosts by IRC
Nickname Evaluation”. In: Proceedings of the First Conference on First Workshop
on Hot Topics in Understanding Botnets. HotBots’07. Cambridge, MA: USENIX
Association, 2007, pp. 8–8. url: http://dl.acm.org/citation.cfm?id=1323128.
1323136.
[10] L. Grangeia. “DNS Cache Snooping or Snooping the Cache for Fun and Profit”. In:
UNC Computer Science (2014).
[11] Guofei Gu, Junjie Zhang, and Wenke Lee. “BotSniffer: Detecting Botnet Command
and Control Channels in Network Traffic.” In: NDSS. The Internet Society, June 18,
2009. url: http://dblp.uni-trier.de/db/conf/ndss/ndss2008.html#GuZL08.
31
[12] Guofei Gu et al. “BotHunter: Detecting Malware Infection Through IDS-driven Dia-
log Correlation”. In: Proceedings of 16th USENIX Security Symposium on USENIX
Security Symposium. SS’07. Boston, MA: USENIX Association, 2007, 12:1–12:16.
isbn: 111-333-5555-77-9. url: http://dl.acm.org/citation.cfm?id=1362903.
1362915.
[13] Guofei Gu et al. “BotMiner: Clustering Analysis of Network Traffic for Protocol- and
Structure-independent Botnet Detection”. In: Proceedings of the 17th Conference on
Security Symposium. SS’08. San Jose, CA: USENIX Association, 2008, pp. 139–154.
url: http://dl.acm.org/citation.cfm?id=1496711.1496721.
[14] G. Gu et al. “Active Botnet Probing to Identify Obscure Command and Control
Channels”. In: 2009 Annual Computer Security Applications Conference. Dec. 2009,
pp. 241–253. doi: 10.1109/ACSAC.2009.30.
[15] C. Guarnieri. Skynet, a Tor-powered botnet straight from Reddit. Ed. by Rapid7.
2012. url: https://community.rapid7.com/community/infosec/blog/2012/12/
06/skynet-a-tor-powered-botnet-straight-from-reddit.
[16] F. Haddadi and A. N. Zincir-Heywood. “Benchmarking the Effect of Flow Exporters
and Protocol Filters on Botnet Traffic Classification”. In: IEEE Systems Journal 10.4
(Dec. 2016), pp. 1390–1401. issn: 1932-8184. doi: 10.1109/JSYST.2014.2364743.
[17] Anestis Karasaridis, Brian Rexroad, and David Hoeflin. “Wide-scale Botnet De-
tection and Characterization”. In: Proceedings of the First Conference on First
Workshop on Hot Topics in Understanding Botnets. HotBots’07. Cambridge, MA:
USENIX Association, 2007, pp. 7–7. url: http://dl.acm.org/citation.cfm?id=
1323128.1323135.
[18] Sheharbano Khattak et al. “BotFlex: A community-driven tool for botnet detection”.
In: Journal of Network and Computer Applications 58 (2015), pp. 144–154. issn:
1084-8045. doi: http://dx.doi.org/10.1016/j.jnca.2015.10.002. url:
http://www.sciencedirect.com/science/article/pii/S1084804515002155.
[19] G. Kirubavathi Venkatesh and R. Anitha Nadarajan. “HTTP Botnet Detection Us-
ing Adaptive Learning Rate Multilayer Feed-forward Neural Network”. In: Proceed-
ings of the 6th IFIP WG 11.2 International Conference on Information Security
Theory and Practice: Security, Privacy and Trust in Computing Systems and Ambi-
ent Intelligent Ecosystems. WISTP’12. Egham, UK: Springer-Verlag, 2012, pp. 38–
48. isbn: 978-3-642-30954-0. doi: 10.1007/978-3-642-30955-7_5. url: http:
//dx.doi.org/10.1007/978-3-642-30955-7_5.
[20] G. Kirubavathi and R. Anitha. “Botnet detection via mining of traffic flow charac-
teristics”. In: Computers Electrical Engineering 50 (2016), pp. 91–101. issn: 0045-
7906. doi: http://dx.doi.org/10.1016/j.compeleceng.2016.01.012. url:
http://www.sciencedirect.com/science/article/pii/S0045790616000148.
[21] C. N. Lee, F. Chou, and C. M. Chen. “Automatically Generating Payload-Based
Models for Botnet Detection”. In: 2015 IEEE International Conference on Smart
City/SocialCom/SustainCom (SmartCity). Dec. 2015, pp. 1038–1044. doi: 10.1109/
SmartCity.2015.206.
32
[22] W. H. Liao and C. C. Chang. “Peer to Peer Botnet Detection Using Data Mining
Scheme”. In: 2010 International Conference on Internet Technology and Applica-
tions. Aug. 2010, pp. 1–4. doi: 10.1109/ITAPP.2010.5566407.
[23] Dan Liu et al. “A P2P-Botnet detection model and algorithms based on network
streams analysis”. In: 2010 International Conference on Future Information Tech-
nology and Management Engineering. Vol. 1. Oct. 2010, pp. 55–58. doi: 10.1109/
FITME.2010.5655788.
[24] Lei Liu et al. “BotTracer: Execution-Based Bot-Like Malware Detection”. In: Pro-
ceedings of the 11th International Conference on Information Security. ISC ’08.
Taipei, Taiwan: Springer-Verlag, 2008, pp. 97–113. isbn: 978-3-540-85884-3. doi:
10.1007/978-3-540-85886-7_7. url: http://dx.doi.org/10.1007/978-3-540-
85886-7_7.
[25] C. Livadas et al. “Using Machine Learning Technliques to Identify Botnet Traffic”.
In: Proceedings. 2006 31st IEEE Conference on Local Computer Networks. Nov.
2006, pp. 967–974. doi: 10.1109/LCN.2006.322210.
[26] M. M. Masud et al. “Flow-based identification of botnet traffic by mining multiple
log files”. In: 2008 First International Conference on Distributed Framework and
Applications. Oct. 2008, pp. 200–206. doi: 10.1109/ICDFMA.2008.4784437.
[27] Shishir Nagaraja et al. “BotGrep: Finding P2P Bots with Structured Graph Anal-
ysis”. In: Proceedings of the 19th USENIX Conference on Security. USENIX Se-
curity’10. Washington, DC: USENIX Association, 2010, pp. 7–7. isbn: 888-7-6666-
5555-4. url: http://dl.acm.org/citation.cfm?id=1929820.1929830.
[28] Shishir Nagaraja et al. “Stegobot: A Covert Social Network Botnet”. In: Proceedings
of the 13th International Conference on Information Hiding. IH’11. Prague, Czech
Republic: Springer-Verlag, 2011, pp. 299–313. isbn: 978-3-642-24177-2. url: http:
//dl.acm.org/citation.cfm?id=2042445.2042473.
[29] Antonio Nappa et al. “CyberProbe: Towards Internet-Scale Active Detection of Ma-
licious Servers.” In: NDSS. The Internet Society, 2014. url: http://dblp.uni-
trier.de/db/conf/ndss/ndss2014.html#NappaXRCG14.
[30] Matthias Neugschwandtner, Paolo Milani Comparetti, and Christian Platzer. “De-
tecting Malware’s Failover C&#38;C Strategies with Squeeze”. In: Proceedings of
the 27th Annual Computer Security Applications Conference. ACSAC ’11. Orlando,
Florida, USA: ACM, 2011, pp. 21–30. isbn: 978-1-4503-0672-0. doi: 10 . 1145 /
2076732.2076736. url: http://doi.acm.org/10.1145/2076732.2076736.
[31] S. K. Noh et al. “Detecting P2P Botnets Using a Multi-phased Flow Model”. In:
2009 Third International Conference on Digital Society. Feb. 2009, pp. 247–253. doi:
10.1109/ICDS.2009.37.
33
[32] Emanuele Passerini et al. “FluXOR: Detecting and Monitoring Fast-Flux Service
Networks”. In: Detection of Intrusions and Malware, and Vulnerability Assessment:
5th International Conference, DIMVA 2008, Paris, France, July 10-11, 2008. Pro-
ceedings. Ed. by Diego Zamboni. Berlin, Heidelberg: Springer Berlin Heidelberg,
2008, pp. 186–206. isbn: 978-3-540-70542-0. doi: 10.1007/978-3-540-70542-0_10.
url: http://dx.doi.org/10.1007/978-3-540-70542-0_10.
[33] Honeynet Project. Know your Enemy: Tracking Botnets. Published on the Web.
url: http://www.honeynet.org/papers/bots/.
[34] S. Saad et al. “Detecting P2P botnets through network behavior analysis and ma-
chine learning”. In: 2011 Ninth Annual International Conference on Privacy, Secu-
rity and Trust. July 2011, pp. 174–180. doi: 10.1109/PST.2011.5971980.
[35] Reza Sharifnya and Mahdi Abadi. “DFBotKiller: Domain-flux botnet detection
based on the history of group activities and failures in DNS traffic”. In: Digital
Investigation 12 (2015), pp. 15–26. issn: 1742-2876. doi: http://dx.doi.org/10.
1016/j.diin.2014.11.001. url: http://www.sciencedirect.com/science/
article/pii/S1742287614001182.
[36] Kamaldeep Singh et al. “Big Data Analytics framework for Peer-to-Peer Botnet
detection using Random Forests”. In: Information Sciences 278 (2014), pp. 488–497.
issn: 0020-0255. doi: http://dx.doi.org/10.1016/j.ins.2014.03.066. url:
http://www.sciencedirect.com/science/article/pii/S0020025514003570.
[37] Brett Stone-Gross et al. “The Underground Economy of Spam: A Botmaster’s
Perspective of Coordinating Large-scale Spam Campaigns”. In: Proceedings of the
4th USENIX Conference on Large-scale Exploits and Emergent Threats. LEET’11.
Boston, MA: USENIX Association, 2011, pp. 4–4. url: http://dl.acm.org/
citation.cfm?id=1972441.1972447.
[38] Brett Stone-Gross et al. “Your Botnet is My Botnet: Analysis of a Botnet Takeover”.
In: Proceedings of the 16th ACM Conference on Computer and Communications
Security. CCS ’09. Chicago, Illinois, USA: ACM, 2009, pp. 635–647. isbn: 978-1-
60558-894-0. doi: 10.1145/1653662.1653738. url: http://doi.acm.org/10.
1145/1653662.1653738.
[39] W. T. Strayer et al. “Detecting Botnets with Tight Command and Control”. In:
Proceedings. 2006 31st IEEE Conference on Local Computer Networks. Nov. 2006,
pp. 195–202. doi: 10.1109/LCN.2006.322100.
[40] Florian Tegeler et al. “BotFinder: Finding Bots in Network Traffic Without Deep
Packet Inspection”. In: Proceedings of the 8th International Conference on Emerging
Networking Experiments and Technologies. CoNEXT ’12. Nice, France: ACM, 2012,
pp. 349–360. isbn: 978-1-4503-1775-7. doi: 10.1145/2413176.2413217. url: http:
//doi.acm.org/10.1145/2413176.2413217.
34
A taxonomy of botnet detection approaches

More Related Content

What's hot

Cryptanalysis by savyasachi
Cryptanalysis by savyasachiCryptanalysis by savyasachi
Cryptanalysis by savyasachiSavyasachi14
 
SIMULATION OF THE COMBINED METHOD
SIMULATION OF THE COMBINED METHODSIMULATION OF THE COMBINED METHOD
SIMULATION OF THE COMBINED METHODIJNSA Journal
 
Introduction to Cyber security module - III
Introduction to Cyber security module - IIIIntroduction to Cyber security module - III
Introduction to Cyber security module - IIITAMBEMAHENDRA1
 
All you know about Botnet
All you know about BotnetAll you know about Botnet
All you know about BotnetNaveen Titare
 
Cryptographic Protocol is and isn't like LEGO.
Cryptographic Protocol is and isn't like LEGO.Cryptographic Protocol is and isn't like LEGO.
Cryptographic Protocol is and isn't like LEGO.Shin'ichiro Matsuo
 
Botnets presentation
Botnets presentationBotnets presentation
Botnets presentationMahmoud Ibra
 
Detection of application layer ddos attack using hidden semi markov model (20...
Detection of application layer ddos attack using hidden semi markov model (20...Detection of application layer ddos attack using hidden semi markov model (20...
Detection of application layer ddos attack using hidden semi markov model (20...Mumbai Academisc
 
Global Botnet Detector
Global Botnet DetectorGlobal Botnet Detector
Global Botnet DetectorBrenton Mallen
 
Cryptography based misbehavior detection for opportunistic network
Cryptography based misbehavior detection for opportunistic networkCryptography based misbehavior detection for opportunistic network
Cryptography based misbehavior detection for opportunistic networkShahana P H
 
THE FIGHT AGAINST IP SPOOFING ATTACKS: NETWORK INGRESS FILTERING VERSUS FIRST...
THE FIGHT AGAINST IP SPOOFING ATTACKS: NETWORK INGRESS FILTERING VERSUS FIRST...THE FIGHT AGAINST IP SPOOFING ATTACKS: NETWORK INGRESS FILTERING VERSUS FIRST...
THE FIGHT AGAINST IP SPOOFING ATTACKS: NETWORK INGRESS FILTERING VERSUS FIRST...ijsptm
 
Study on Botnet Architecture
Study on Botnet ArchitectureStudy on Botnet Architecture
Study on Botnet ArchitectureBini Bs
 
Botnet Detection Techniques
Botnet Detection TechniquesBotnet Detection Techniques
Botnet Detection TechniquesTeam Firefly
 
Anomaly Detection of IP Header Threats
Anomaly Detection of IP Header ThreatsAnomaly Detection of IP Header Threats
Anomaly Detection of IP Header ThreatsCSCJournals
 

What's hot (20)

Botnets' networks
Botnets' networksBotnets' networks
Botnets' networks
 
Cryptanalysis by savyasachi
Cryptanalysis by savyasachiCryptanalysis by savyasachi
Cryptanalysis by savyasachi
 
SIMULATION OF THE COMBINED METHOD
SIMULATION OF THE COMBINED METHODSIMULATION OF THE COMBINED METHOD
SIMULATION OF THE COMBINED METHOD
 
Introduction to Cyber security module - III
Introduction to Cyber security module - IIIIntroduction to Cyber security module - III
Introduction to Cyber security module - III
 
Defense against botnets
Defense against botnetsDefense against botnets
Defense against botnets
 
All you know about Botnet
All you know about BotnetAll you know about Botnet
All you know about Botnet
 
Cryptographic Protocol is and isn't like LEGO.
Cryptographic Protocol is and isn't like LEGO.Cryptographic Protocol is and isn't like LEGO.
Cryptographic Protocol is and isn't like LEGO.
 
Presentation1
Presentation1Presentation1
Presentation1
 
Botnets
BotnetsBotnets
Botnets
 
Botnets presentation
Botnets presentationBotnets presentation
Botnets presentation
 
Detection of application layer ddos attack using hidden semi markov model (20...
Detection of application layer ddos attack using hidden semi markov model (20...Detection of application layer ddos attack using hidden semi markov model (20...
Detection of application layer ddos attack using hidden semi markov model (20...
 
Global Botnet Detector
Global Botnet DetectorGlobal Botnet Detector
Global Botnet Detector
 
Cryptography based misbehavior detection for opportunistic network
Cryptography based misbehavior detection for opportunistic networkCryptography based misbehavior detection for opportunistic network
Cryptography based misbehavior detection for opportunistic network
 
THE FIGHT AGAINST IP SPOOFING ATTACKS: NETWORK INGRESS FILTERING VERSUS FIRST...
THE FIGHT AGAINST IP SPOOFING ATTACKS: NETWORK INGRESS FILTERING VERSUS FIRST...THE FIGHT AGAINST IP SPOOFING ATTACKS: NETWORK INGRESS FILTERING VERSUS FIRST...
THE FIGHT AGAINST IP SPOOFING ATTACKS: NETWORK INGRESS FILTERING VERSUS FIRST...
 
Study on Botnet Architecture
Study on Botnet ArchitectureStudy on Botnet Architecture
Study on Botnet Architecture
 
DDOS ATTACKS
DDOS ATTACKSDDOS ATTACKS
DDOS ATTACKS
 
Botnet Detection Techniques
Botnet Detection TechniquesBotnet Detection Techniques
Botnet Detection Techniques
 
Net Defender
Net DefenderNet Defender
Net Defender
 
Netdefender
NetdefenderNetdefender
Netdefender
 
Anomaly Detection of IP Header Threats
Anomaly Detection of IP Header ThreatsAnomaly Detection of IP Header Threats
Anomaly Detection of IP Header Threats
 

Similar to A taxonomy of botnet detection approaches

Lightweight C&C based botnet detection using Aho-Corasick NFA
Lightweight C&C based botnet detection using Aho-Corasick NFALightweight C&C based botnet detection using Aho-Corasick NFA
Lightweight C&C based botnet detection using Aho-Corasick NFAIJNSA Journal
 
Fortinet_FortiDDoS_Introduction
Fortinet_FortiDDoS_IntroductionFortinet_FortiDDoS_Introduction
Fortinet_FortiDDoS_Introductionswang2010
 
A Survey of Botnet Detection Techniques
A Survey of Botnet Detection TechniquesA Survey of Botnet Detection Techniques
A Survey of Botnet Detection Techniquesijsrd.com
 
A New Architecture for Multiparty Web Real-Time Conferencing Systems
A New Architecture for Multiparty Web Real-Time Conferencing SystemsA New Architecture for Multiparty Web Real-Time Conferencing Systems
A New Architecture for Multiparty Web Real-Time Conferencing SystemsIJCSIS Research Publications
 
lab3cdga.ziplab3code.c#include stdio.h#include std.docx
lab3cdga.ziplab3code.c#include stdio.h#include std.docxlab3cdga.ziplab3code.c#include stdio.h#include std.docx
lab3cdga.ziplab3code.c#include stdio.h#include std.docxsmile790243
 
PeerToPeerComputing (1)
PeerToPeerComputing (1)PeerToPeerComputing (1)
PeerToPeerComputing (1)MurtazaB
 
A Brief Note On Peer And Peer ( P2P ) Applications Have No...
A Brief Note On Peer And Peer ( P2P ) Applications Have No...A Brief Note On Peer And Peer ( P2P ) Applications Have No...
A Brief Note On Peer And Peer ( P2P ) Applications Have No...Brenda Thomas
 
What is WebRTC and How does it work?
What is WebRTC and How does it work?What is WebRTC and How does it work?
What is WebRTC and How does it work?SandipPatel533958
 
Botnet detection by Imitation method
Botnet detection  by Imitation methodBotnet detection  by Imitation method
Botnet detection by Imitation methodAcad
 
DDOS Attack on Cloud Platforms.pptx
DDOS Attack on Cloud Platforms.pptxDDOS Attack on Cloud Platforms.pptx
DDOS Attack on Cloud Platforms.pptxShaimKibria
 
Topic # 16 of outline Managing Network Services.pptx
Topic # 16 of outline Managing Network Services.pptxTopic # 16 of outline Managing Network Services.pptx
Topic # 16 of outline Managing Network Services.pptxAyeCS11
 
Audio Video Conferencing in Distributed Brokering Systems
Audio Video Conferencing in Distributed Brokering SystemsAudio Video Conferencing in Distributed Brokering Systems
Audio Video Conferencing in Distributed Brokering SystemsVideoguy
 
Detecting Victim Systems In Client Networks Using Coarse Grained Botnet Algor...
Detecting Victim Systems In Client Networks Using Coarse Grained Botnet Algor...Detecting Victim Systems In Client Networks Using Coarse Grained Botnet Algor...
Detecting Victim Systems In Client Networks Using Coarse Grained Botnet Algor...IRJET Journal
 
Guarding Against Large-Scale Scrabble In Social Network
Guarding Against Large-Scale Scrabble In Social NetworkGuarding Against Large-Scale Scrabble In Social Network
Guarding Against Large-Scale Scrabble In Social NetworkEditor IJCATR
 
Network and security concepts
Network and security conceptsNetwork and security concepts
Network and security conceptssonuagain
 

Similar to A taxonomy of botnet detection approaches (20)

Lightweight C&C based botnet detection using Aho-Corasick NFA
Lightweight C&C based botnet detection using Aho-Corasick NFALightweight C&C based botnet detection using Aho-Corasick NFA
Lightweight C&C based botnet detection using Aho-Corasick NFA
 
Paper(edited)
Paper(edited)Paper(edited)
Paper(edited)
 
Fortinet_FortiDDoS_Introduction
Fortinet_FortiDDoS_IntroductionFortinet_FortiDDoS_Introduction
Fortinet_FortiDDoS_Introduction
 
A Survey of Botnet Detection Techniques
A Survey of Botnet Detection TechniquesA Survey of Botnet Detection Techniques
A Survey of Botnet Detection Techniques
 
A New Architecture for Multiparty Web Real-Time Conferencing Systems
A New Architecture for Multiparty Web Real-Time Conferencing SystemsA New Architecture for Multiparty Web Real-Time Conferencing Systems
A New Architecture for Multiparty Web Real-Time Conferencing Systems
 
lab3cdga.ziplab3code.c#include stdio.h#include std.docx
lab3cdga.ziplab3code.c#include stdio.h#include std.docxlab3cdga.ziplab3code.c#include stdio.h#include std.docx
lab3cdga.ziplab3code.c#include stdio.h#include std.docx
 
PeerToPeerComputing (1)
PeerToPeerComputing (1)PeerToPeerComputing (1)
PeerToPeerComputing (1)
 
A Brief Note On Peer And Peer ( P2P ) Applications Have No...
A Brief Note On Peer And Peer ( P2P ) Applications Have No...A Brief Note On Peer And Peer ( P2P ) Applications Have No...
A Brief Note On Peer And Peer ( P2P ) Applications Have No...
 
What is WebRTC and How does it work?
What is WebRTC and How does it work?What is WebRTC and How does it work?
What is WebRTC and How does it work?
 
Botnet detection by Imitation method
Botnet detection  by Imitation methodBotnet detection  by Imitation method
Botnet detection by Imitation method
 
DDOS Attack on Cloud Platforms.pptx
DDOS Attack on Cloud Platforms.pptxDDOS Attack on Cloud Platforms.pptx
DDOS Attack on Cloud Platforms.pptx
 
DoS/DDoS
DoS/DDoSDoS/DDoS
DoS/DDoS
 
about botnets
about botnetsabout botnets
about botnets
 
098
098098
098
 
Topic # 16 of outline Managing Network Services.pptx
Topic # 16 of outline Managing Network Services.pptxTopic # 16 of outline Managing Network Services.pptx
Topic # 16 of outline Managing Network Services.pptx
 
Audio Video Conferencing in Distributed Brokering Systems
Audio Video Conferencing in Distributed Brokering SystemsAudio Video Conferencing in Distributed Brokering Systems
Audio Video Conferencing in Distributed Brokering Systems
 
Detecting Victim Systems In Client Networks Using Coarse Grained Botnet Algor...
Detecting Victim Systems In Client Networks Using Coarse Grained Botnet Algor...Detecting Victim Systems In Client Networks Using Coarse Grained Botnet Algor...
Detecting Victim Systems In Client Networks Using Coarse Grained Botnet Algor...
 
Guarding Against Large-Scale Scrabble In Social Network
Guarding Against Large-Scale Scrabble In Social NetworkGuarding Against Large-Scale Scrabble In Social Network
Guarding Against Large-Scale Scrabble In Social Network
 
Network and security concepts
Network and security conceptsNetwork and security concepts
Network and security concepts
 
botnet.ppt
botnet.pptbotnet.ppt
botnet.ppt
 

More from Fabrizio Farinacci

Project in malware analysis:C2C
Project in malware analysis:C2CProject in malware analysis:C2C
Project in malware analysis:C2CFabrizio Farinacci
 
Classifying IoT malware delivery patterns for attack detection
Classifying IoT malware delivery patterns for attack detectionClassifying IoT malware delivery patterns for attack detection
Classifying IoT malware delivery patterns for attack detectionFabrizio Farinacci
 
RecipeX - Your personal caregiver and lifestyle makeover
RecipeX - Your personal caregiver and lifestyle makeoverRecipeX - Your personal caregiver and lifestyle makeover
RecipeX - Your personal caregiver and lifestyle makeoverFabrizio Farinacci
 
RecipeX - Your personal caregiver and lifestyle makeover
RecipeX - Your personal caregiver and lifestyle makeoverRecipeX - Your personal caregiver and lifestyle makeover
RecipeX - Your personal caregiver and lifestyle makeoverFabrizio Farinacci
 
Redis - Usability and Use Cases
Redis - Usability and Use CasesRedis - Usability and Use Cases
Redis - Usability and Use CasesFabrizio Farinacci
 

More from Fabrizio Farinacci (6)

Project in malware analysis:C2C
Project in malware analysis:C2CProject in malware analysis:C2C
Project in malware analysis:C2C
 
Classifying IoT malware delivery patterns for attack detection
Classifying IoT malware delivery patterns for attack detectionClassifying IoT malware delivery patterns for attack detection
Classifying IoT malware delivery patterns for attack detection
 
The Same-Origin Policy
The Same-Origin PolicyThe Same-Origin Policy
The Same-Origin Policy
 
RecipeX - Your personal caregiver and lifestyle makeover
RecipeX - Your personal caregiver and lifestyle makeoverRecipeX - Your personal caregiver and lifestyle makeover
RecipeX - Your personal caregiver and lifestyle makeover
 
RecipeX - Your personal caregiver and lifestyle makeover
RecipeX - Your personal caregiver and lifestyle makeoverRecipeX - Your personal caregiver and lifestyle makeover
RecipeX - Your personal caregiver and lifestyle makeover
 
Redis - Usability and Use Cases
Redis - Usability and Use CasesRedis - Usability and Use Cases
Redis - Usability and Use Cases
 

Recently uploaded

complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 

Recently uploaded (20)

complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 

A taxonomy of botnet detection approaches

  • 1. Seminars in Advanced Topics in Engineering in Computer Science - Final Project: A taxonomy of botnet detection approaches Fabrizio Farinacci <farinacci.1530961@studenti.uniroma1.it> September 3, 2017 Abstract Botnets gained wide visibility in the last few years, becoming one of the most observed and dangerous threats in the malware landscape. This is mainly due to the fact that botnets represents a very valuable and flexible asset in the arsenal of hackers and APTs. We’ve seen botnets becoming real cyber-weapons, capable of targeting nations and the business of famous companies. For this reason, it is crucial to design techniques able to detect bot-infected hosts at different levels (enterprise, ISP, etc.). Different kind of techniques have been studied and researched in the last years, in a never ending race between attackers and defenders. In this work a representative set of the entire literature is analyzed, enlightening the different kind of state-of-the- art approaches that researchers have followed with the ultimate goal of designing effective botnet detection solutions. The objective is producing a taxonomy of the botnet detection techniques, showing possible research directions in designing new techniques to mitigate the risks associated to botnet based attacks. 1 Botnets: A brief introduction When talking about botnets, we refer to a set of malware compromised machine, controlled by one or more botmasters through commands submitted by means of specific channels, without the users of those machines being aware of it. This imply for the bot machine to become part of a large army of zombie hosts, devoted to perform malicious activities such as DDoS attacks and spam campaigns, as established by the botmasters. Botnets can be distinguished on the base of their structural, technological and behavioral characteristics, as described in the following subsections. 1.1 Architecture It’s the topology upon the Command and Control infrastructure used to govern the bots it is built on. C2C architecture evolved over time to reduce the possibility of enumeration and discovery of infected machines and increase the botnet resilience to shut down. 1
  • 2. 1.1.1 Client-server The client-server model was used in the first type of botnets that appeared in the wild. It was usually built on Internet Relay Chat (IRC), using IRC servers to send the command to control the infected hosts or using Domains and Websites containing the list of all the commands for the botnet to be controlled. In both cases, infected hosts needed to connect to the IRC server or to the C2C domain to obtain the commands and perform their malicious tasks. The main drawback, leading to the progressive disappering of the model, is the fact that servers and domains are single-point-of-failure: most of the botnets have been taken down in a matter of time and the use of techniques like Dynamic DNS and Fast-Fluxing only managed to slow down the ultimately takedown. This is why hackers moved to P2P solutions to increase botnet resilience and avoid takedown. 1.1.2 Peer-to-peer Peer-to-peer (P2P) architectures are characterized by their topology flexibility and un- predictability, that makes them more difficult to enumerate and discover, and consequently more difficult to takedown completely. Newer botnets tend to be more and more based on peer-to-peer, to reduce the risk of being shutdown. C2C is embedded directly into the botnet hosts, rather than relying to an external server that may become a single point of failure in the botnet functioning. Also, is very common to use public key cryptography to secure the data relayed in the peer-to-peer network and identify commander hosts that are also part of the peer-to-peer network along with their zombie counterparts. Each bot knows only a list of peers to which send the commands that are then relayed to other peers going deeper into the P2P network; this list usually includes around 256 peers that makes the list small enough to be passed to other peers, fighting against botnet takedown while allowing online bots to stay in contact. Even though peer-to-peer based botnets are much harder to disrupt, they are not invulnerable against attacks or disruption. Two common techniques to face P2P botnets are crawling and sinkholing. With crawling is possible to enumerate all or most of the bots part of the network; once the bots have been enumerated, sinkholing can be used to achieve distruption. It relies on the typical peer-list flooding technique used by P2P botnets to achieve full coverage and works by injecting in the peer-lists of all the bots of the network fake nodes that may be either controlled by defenders or inexistent, making the bots pointing to a ”black hole” and modifying the structure of the network turning it into a centralized system, that can be easily takedown. 1.2 General communication techniques Another distinguishing feature of the C2C it’s the communication technique used to receive commands and to send data. The communication channel plays an important role in the malware persistance within the system: this is why newer malware often use very ”creative” solutions to go pass unnoticed and exploiting covert channels to avoid detection. 2
  • 3. 1.2.1 Domains (HTTP based solutions) Using domains as C2C servers was one of the first solutions adopted by botnets: well crafted domains or websites, containing all the commands for the zombie hosts, that only has to connect to it to retrieve those by using simple HTTP requests. The main advantage of this solution is that even big botnets can be easily maintained and updated by simply update the content of the domain or website. The biggest disadvantage is that even fault tolerant solutions with replicated servers can be quickly takedown by governments or may also be easily targeted by denial-of-service attacks. Another disadvantage is the bandwidth consumption for the domain, that is high compared to other solutions. This is why pure domain based solutions are no longer used by malware developers. Instead, it may happen that some phases of the botnet installation and initial setup may still be based on domain servers, typically using dynamic DNS to change IP frequently and avoid being shutdown. 1.2.2 IRC Another widely adopted solution was using the IRC (Internet Realy Chat) protocol and IRC servers to serve as C2C servers. Infected clients connected to an infected IRC server to join a IRC channel, created by the botmaster and dedicated to C2C traffic. The botmaster can then simply send IRC messages to the channel to reach all its members through message broadcast. The main advantage of IRC based solution is low bandwidth consumptions because of the IRC protocol as communication protocol. The disadvantages are forced simplicity and low shutdown resilience as in the domain case. It has also been proved that keyword blaklisting has been effective in blocking IRC based botnets. For this reasons, pure IRC based solutions are no longer adopted. Instead, the IRC protocol, for its bandwidth consuption and proved simplicity, may still be used to carry the botnet’s communication in combination with other solutions such as Tor and .onion domains. 1.2.3 P2P protocols Along with the spread of P2P based architecture solutions, botnets started also using existing P2P overlay protocols as communication channel for their C2C communica- tions. A common example is the Kad network (based on UDP), a peer-to-peer network implementing the Kademlia P2P overlay protocol in their overlay network. The first P2P file-sharing programs relied on such network, using client programs supporting Kad net- work implementation. Since these were very popular, especially in the the 2010s, malware developer started to use the Kad network as a C2C covert channel. This is the case of Alureon1 (aka TDSS), that according to Microsoft was one of the most active botnets in the second quarter of 20102, that included encrypted communications and a decentralized C2C relying on the Kad network. 1 https://en.wikipedia.org/wiki/Alureon 2 http://download.microsoft.com/download/8/1/B/81B3A25C-95A1-4BCD-88A4-2D3D0406CDEF/ Microsoft_Security_Intelligence_Report_volume_9_Battling_Botnets_English.pdf 3
  • 4. 1.2.4 DNS DNS has been used (and abused) by malware developers due to its potential: it allows to create and register a set of static or dynamically generated domains (by using Domain generation algorithms), and continue changing the IPs the domains resolve to, avoiding IP blacklisting and increasing dramatically the takedown effort for governments. DNS is also very valuable for another reason: since DNS queries and responses are rarely blocked by firewalls, it offers a very good solution to transmit and receive information avoiding detection. DNS covert channels started to be increasingly used by malware to transmit payload data, tunnel other application protocols and secure the DNS payload with encryption, requiring very little investments and no complex infrastructure to work. DNS request/responses gets formatted according to the DNS syntax, typically carrying a text formatted resource record payload that may support a chunking mechanism, avoiding the DNS resource record size limits (255 bytes). This meachanism has been exploited by Feederbot [6], one of the first malware using DNS covert channels. 1.2.5 Tor Even if centralized solutions suffered from easy identification and consequent shutdown, in terms of simplicity and manageability, were the best from the botmaster perspective. This is why during the last years[2] , a new trend started to appear in the wild: Tor based botnets. Tor is an anonymous communication network based on the onion routing protocol, in which the information is sent through a virtual cirtuit of randomly chosen relay nodes (typically 3) part of the Tor network, in encrypted form, by negotiating symmetric encryption keys with those ones. The real advantage for malware developers is another feature: Hidden Services (HSs), that are web services published anonymously in the Tor network and reachable without the need of knowing their location. The only thing that must be known to get in touch with these ones is their descriptor, in form of .onion address. Malware developers started to exploit HSs as more robust and resilient domains to control their bots, that just needed to include a Tor client to connect to the HSs. Furthermore, the HS can be located at infected machine side, creating distributed solutions even more resilient to shutdown. This solution has been adopted by Skynet [15], the first botnet found in the wild exploiting Tor and Hidden Services as communication channel. 1.2.6 Social networks Very recently a very promising and threatening solution started to be exploited in the wild: social network C&C. Malware developers started to be interested in social media as command and control flow for many reason: the chat-like capabilities (remembering the old times with IRC), the fact that access to the social media through HTTP and HTTPS connection is rarely blocked by firewalls and also hardly identified as malicious by network software monitors, the possibility to use the well-tested and powerful APIs of such sites and most of all the extreme facility behind fake botmaster account (or multiple accounts) creation. For this reasons, the use of social networks as C2C channel is a topic of high interest at the moment and many studies on the real potentiality of this solution are still under evaluation. A well known example is Stegobot [28], a very stealthy botnet exploiting 4
  • 5. social networks and steganography, to encode commands into images that are then posted in the social network for other bot members to see it. 1.3 Goals Tipically, two are the generic goals in the mind of malware developers: monetization and adversary defeat. This is also the reason behind the evolution of the generic malware goals: some goals used to be very profitable/effective once but now are not anymore, this is why are less frequently observed in the wild. Is very typical for a malware to have multiple goals, to increase the revenue and improve reusability; this comes at the price of sthealthiness: the more the activities, the easier for the botnet to be detected. Well-known example of goals and activities performed by botnets are: • E-mail spam, consisting in sending spam e-mail messages with different objectives (advertisments, phishing, exc.); it use to be very profitable once (till late 2010s), now it’s a rare activity. • Credential sniffing and data exfiltration, to steal credential and sensitive/secret data from the victim machines, gathered by the botmaster for later reuse or for selling them on black markets. • Denial-of-service attacks, that is probably the most typical activity of newer botnets, turning them into real cyber-weapons to take down unwanted targets34 or to put on rent to obtain profits5. • Bitcoin mining, consisting in using the computational capabilities of the infected machines to mine Bitcoins to obtain the reward. • Click fraud, consisting in exploit the infected machine to obtain revenue by means of pay-per-click (PPC) advertisments. 3 https://krebsonsecurity.com/2016/09/krebsonsecurity-hit-with-record-ddos/ 4 http://www.theregister.co.uk/2016/10/21/dyn_dns_ddos_explained/ 5 https://www.bleepingcomputer.com/news/security/you-can-now-rent-a-mirai-botnet-of-400-000-bots/ 5
  • 6. 2 Botnet Detection: Problems and challenges The detection of bot infections poses challenges for defenders, since attackers design their malware having in mind the techniques used to detect bot infections; so it’s again the never ending game between attackers and defenders. The first challenge for defenders is given by the fact that the architecture used by botnets it’s constantly evolving both to increase the robustness and the stealthiness against distruption and detection attempts, moving from traditional IRC based and centralized solutions, to more stealthy HTTP based but still centralized solution and even to P2P based solutions offering the most advantages both from the point of view of robustness and stealthiness. Another challenge it is posed by increasing the stealthiness of malicious activities, evolving to more profitable and more harldy observable models. A typical example is the disap- pearing of e-mail spam activity, easier to detect because of the employing of anomalous DNS MX queries and SMTP traffic, in favor of the rise of more stealthy click fraud solu- tions, employing more stealthier HTTP communication and being very profitable. Finally, since the number of internet connected devices is monotonicaly increasing and estimates states that by 2020 about 20 billion Internet of Things (IoT) devices, known to be particularly vulnerable to attacks and malware infections, will be connected to the Internet6, the design of scalable and performance efficient solution will be one of the major challenges for researchers in the field of botnet detection in the next years. 3 Botnet detection approches taxonomy To contrast the arising botnet phenomenom a wide set of possible detection strategies has been designed, implemented and evaluated by the research community. Those de- tection approaches can be categorized upon many different perspectives, describing their capabilities and specific methodologies to face the problem of the detection (and possibly characterization) of bot infected hosts at different network scales (enterprise, ISP, etc.). A possible taxonomy of botnet detection methodologies is shown in Figure 1, where botnet detection methods are characterized in terms of: • Which botnet type they are designed to detect (3.1); • What aspect of the botnet they are designed to track (3.2); • What is the source from which the feature are extracted (3.3); • How the feature used for the detection are extracted (3.4); • How the collected features are correlated to derive conclusions (3.5); • What is the algorithm used to detect the presence of botnets (3.6); 6 http://www.gartner.com/newsroom/id/3598917 6
  • 7. Figure 1: Botnet detection approaches taxonomy. 7
  • 8. In the following chapters, those characterization are first briefly introduced and then detailled in terms of the state-of-the-art literature solutions presented in this work, showing how the problem of botnet detection it is addressed by currently available solutions. 3.1 Botnet type As explained in Section 1, botnet may differ on the base of structure, communication technique and the malicious activities it performs once it has infected a target and it is instucted by the botmaster to do so. This means that in the wild we may encounter a wide variety of different kind of botnets we must be able to detect and identify. Also, different characteristics means both that different challenges has to be overcomed and that different kind of features can be leveraged to reach the detection goal. A possible taxonomy, showing what are the different type of botnets the currently available detection methods try to detect, is shown in Figure 2. Figure 2: Botnet detection approaches taxonomy: Botnet type. In brief, from the botnet type perspective, we have the following characterization: • Centralized, focused on detecting botnets with centralized structure, that is the most traditional case; those can be further classified into: – IRC-based botnets: Is the most typical case, where the commands are sent (pushed) through IRC channels and servers to control the zombie hosts; – HTTP-based botnets: Another classical scheme, in which the commands are retrieved (pulled) by the bots from the C&C using the HTTP protocol; – Domain-Flux botnets: That employs Domain-Fluxing techniques in their centralized structure to increase the botnet resilience against disruption at- tempts; the bots may need to be equipped with Domain Generation Algorithms (DGAs) to be able to get in touch with the CC server by dynamically generating the current domain behind which this latter is hosted. • Peer-to-peer (P2P), so the detection process is focused on detecting P2P and distributed botnets, more robust than the centralized ones; In addition, botnet detection techniques may also decide to not leverage particular struc- tural/protocol characteristics of botnets: in this case, we say that the detection mechanism is not tailored to a particular botnet type, but it’s rather generic. 8
  • 9. 3.1.1 Centralized botnets Most of the initial effort put by researchers into designing effective and useful botnet de- tection mechanisms focused on detecting centralized botnets. This is due to the fact that the very first example of botnets were indeed based on a centralized structure, in which bots were controlled by the botmaster through malicious C&C servers, oftenly hard-coded directly in the bot logic. The predominance of this paradigm directly reflected on the literature of those first years, dominated by centralized-based botnet detection methods. IRC based: This is most typical case: due to its simplicity and low bandwidth require- ments, the IRC protocol has been for long time the favorite mechanism of botmasters to design botnets. For this reason, many are the proposed methods to detect those kind of botnets [33] [25] [39] [9] [17] [11] [44] [14]. Goebel et al. [9] presented Rishi, a IRC botnet detection method employing n-gram analy- sis scoring to identify bot-infected hosts, raising an alarm and reporting infected machines to network administrators through automatically generated warning e-mails. Rishi is ca- pable of detecting bot-infected machines by exploiting the fact that those must contact their C&C server directly after infection by means of specific IRC channels that have par- ticular characteristics which distinguishes them from non-malicious IRC channels; those are particular naming conventions (e.g. country codes to understand the bot location, big univoke numbers to avoid duplicate names, OS substring identifiers, etc.) observed in bots, matching to regular expression of known botnets and whitelist/blacklist matching, with those one maintained both statically (with manual updates by the maintainer) or dynamically by Rishi itself, exploiting the same threshold scoring mechanism used for the detection of bots. This mechanism enables Rishi, that is just a small Python script with less that 2k lines of code, to achieve very good detection results. Problem of Rishi is that it requires signatures and to perform packet inspection, which is not always possible both because of timing constraints and the use of encryption techniques by bots. Karasaridis et al. [17] designed a wide-scale botnet detection and characterization method capable to characterize and identify bot-infected machines at Tier-1 ISP network level by means of non-intrusive, header-based and scalable detection algorithms. Their solution first identifies suspicious host involved in Candidate Controller Communications (CCCs) exploiting IRC protocol traffic flow characteristics and communication involving remote hosts appearing to be hub servers (i.e. host having multiple connections with many suspected hosts). Then those CCCs are analyzed and aggregated by means of their traffic characteristics and their maliciousness gets finally validated in terms of additional sources of information (e.g. honeypot based detection, domain name validation, exc.) or by means of the activities performed once the suspect hosts are observed (e.g scanning, spamming, etc.) that is used in addition to characterized the particular bot infection and the involved hosts by means of a similarity measure exploiting the observed activities. 9
  • 10. HTTP based: Centralized botnets may also employ the HTTP protocol, to blend more smoothly within the benign background traffic of the infected hosts. In the literature there are not many works specifically tailored to HTTP based detection [7] [19]; but rather, there are many methods whose autors claim to be effective against centralized botnets, in- dependently if the communication protocol used is IRC or HTTP [17] [39] [11] [44] [4] [42]. Yen et al. [44] designed T ¯AMD (Traffic Aggregation for Malware Detection), a system aiming at identifying even stealthy malware communication by exploiting the fact that rarely the infection is constrained to a single host within big enterprise network and so ag- gregating the traffic exploiting similar characteristics to infer the precence of an infection in the network, that can also be a novelt and never observed one. So the goal is to iden- tify aggregates of traffic involving multiple hosts sharing similar characteristics in terms of common destinations, similar payload (i.e. string edit distance matching with moves) and common host platform for justifying infections even in a platform spcific manner in terms of the defined aggregates, leveraging no backgroud infection information but only past knowledge about the normal traffic in the network and employing efficient algorithms in the area of signal processing and data mining, achiving very good detection rate even when a small percentage of host was indeed infected (0.0097%). Domain-Flux botnet: To improve their resilience against takedown attempts, central- ized botnets moved to more complex models employing Fast-Fluxing techniques, in which a multiple set of hosts, each associated with a different IP address, register and de-register to a domain name very quickly and repeatedly (employing Round-Robin like schemes). This, combined with Domain Generation Algorithms (DGAs), enables to change dynam- ically the domain name of the C&C servers without changing the botnet source code and increases dramatically the takedown efforts. But this however comes at the price of the botnet stealthiness: DGAs produces anomalous DNS traffic, enable to design techniques capable to detect Fast-Flux-based botnets[43][4]. Yadav et al. [43] designed a botnet detection technique that exploits the features of DGAs to detect Fast-Fluxing botnets. The core ideas are that (i) the domain names cre- ated by DGAs have an high level of entropy (required to make it hard to predict them) unlike the human generated ones and (ii) the bots generate many subsequent failed DNS queries close to successful one. Those characteristics are exploited in the detection phase, by implementing an on-line detection mechanism working in bins (i.e. time windows) and performing time and entropy correlation to the DNS queries happening in specific bins. In addition, to improve the efficiency, a staged filtering methods employing IP whitelisting, IP degree (i.e. # domains resolving to that IP), temporal correlation across bins with proba- bility of DNS failures threshold (to preserve only hosts with sufficient fail probability) and succeding/failing domain set entropies (using edit distance for entropy calculations), en- sures that only a small portion of traffic goes through all the steps, consequently reducing the computational effort and enabling the system to process Tier-1 ISP generated traffic with resonable latency, affording real-time detection with high detection accuracy. 10
  • 11. 3.1.2 Peer-to-peer (P2P) botnets As already stated in Section 1.1, the P2P model started to be favored over the centralized one due to its increased robustness. In addition, along with the rise of P2P file-sharing applications, the capability of the botnet traffic to go pass unnoticed increased, making it harder for the defenders to identify it within the possibly huge background traffic. For this reason, researchers moved their focus to detection schemes capable to detect this new kind of botnets, even in presence of background benign P2P traffic. Many are the botnet detection methods in the literature specifically designed to detect P2P botnets and they don’t lack in variety, with methods based on data mining [22], big data analytics [36], machine learning [34], graph-based techniques [27] [5] [8] and more [31] [23] [46] [45] [3]. Chang et al. [3] designed a P2P botnet detection methods employing behavior clustering and statistical tests. The intuition is that by monitoring the network prior infection, it is possible to cluster active nodes within behavior clusters; this is done by selecting at- tributes that favor the identification of normal behaviors within the network such as the popularity of the destinations contacted by the nodes, since normal users tend to con- tact a small range of destinations with different popularities while bots instead tend to caontact an high number of destination independently from their popularity. The nodes are clustered into behavior clusters by means of a classic agglomerative clustering algo- rithm that initially consider each single node as a cluster that is expanded at each round considering the pairwise distance among each current behavior cluster, measured by the extended Jaccard distance and updating only if this one is below a certain threshold, until termination when stability is reached. Then, the identified behavior clusters are used to determine if a new behavior (i.e. a C&C behavior) suddendly introduced in the network: this is done by performing two kind of statistical test that considers the l-most popular behaviors and if the new data altered either (i) the proportion of nodes belongig to those clusters or (ii) the intra-cluster distance amon those clusters. If a sufficient number of those cluster gets altered, an alarm it is raised and the C&C behaviors are identified by using the distance between the updated behaviors in the new observations and the closest behavior in the old one, clustering nodes between C&C and normal behavior and yielding to identification of all bots and low false positive rate (0.05%). A problem in P2P botnet detection is that if malicious activities performed by bots are hardly observable, detection of those sthealty botnets can be difficult in presence of be- nign traffic originating from legitimate P2P applications. This problem has been faced by Zhang et al. [45] [46], designing a detection method capable to identify sthealty P2P botnets by means of statistical traffic fingerprints. The detection method in [46] involves 5 steps. First the (i) traffic is reduced by filtering connection involving DNS-resolved IP addresses, since the vast majority of traffic generated by P2P applications is toward IP destinations hard-coded or obatined by looking up IPs from routing tables of the over- lay network. Then P2P clients are identified: (ii) coarse-grained detection of P2P clients by exploiting the fact that P2P application perform many failed outgoing connection at- tempts to remote peers to identify candidate P2P clients; (iii) fine-grained detection of P2P clients by means of a flow clustering process that groups the similar flows of each 11
  • 12. candidate P2P client (in terms of traffic characteristics like # of sentreceived packets and bytes) into fingerprints clusters, enabling the characterization of P2P clients in terms of their set of fingerprint clusters and dropping all candidates with no fingerprint clusters or with destinations covering too few BGP prefixes (it is unlikely for a P2P application); Finally P2P bots are identified among P2P clients: (iv) coarse-grained detection of P2P bots by exploiting the fact that to maintain active their overlay network, P2P bots have to stay alive to a time comparable to the one of the infected system and so retaining only the candidate P2P bots having their estimated active time (i.e. the time between the last and the first observed packet) comparable to the active time of the P2P application, estimated by the statistical fingerprint having the maximum active time among the ones of the node; (v) fine-grained detection of P2P bots that exploits the fact that bots in the same botnet use the same protocol (i.e. the average number of bytes sentreceived is similar) and the set of peers contacted by the bots of the same botnets largely overlaps (i.e. differently from legitimate P2P application), defines two distance functions upon those observations that are then used to group candidate P2P bots in clusters and preserving only the clus- ters that are sufficiently dense, because the distance between two host is decided by the minimum distance of fingerprint clusters (i.e. the set of contacted IPs largely overlaps) that enables to signal those dense clusters as constituted by P2P bots of a specific fam- ily. In addition to the already very promising results of [46], in [45] the performance of the detection algorithm are improved by eliminating the time expensive step (ii) without sacrificing its detection capabilities and parallelizing the other time expensive step (v) employing a two-step clustering process applied to each single host that enables the work parallelization by considering multiple computational nodes that first partition the hosts into sets to produce multiple two-step clusters in the end aggregated by a cumulative func- tion. This enables [45] to have 60% reduced storage costs due to bypass of (ii) and overall 68% processing time reduction, making the solution highly scalable and performant. 3.1.3 Generic botnets Along with the methods that exploits the stuctural characteristics of the botnets, we can find in the literature many detection methodologies that do not take advantage of these properties in their detection algorithms. The advantage of this generalization is the capability to detect a potential higher number of different botnets type, at the price of having less features to exploit in the detection decision. In particular we can find detection methods leveraging signature-based models [41] [21], machine learning [40] [48] [16], data mining [26] [20] and more [12] [13] [24] [18]. Many of the proposals shown in this work detects generic botnets (i.e. all those mentioning no specific botnet type); examples are BotHunter [12] and BotMiner, by Gu et al. and presented in the later sections (3.3.2). 3.2 Detection target Another way in which botnet detection methods can be classified is in terms of what signs of infection they leverage to detect the presence of botnets within the monitored network. Very simply, we have essentially two possibilities C&C detection or Bot activity detection (Figure 3). Briefly, the characteristics of the two are: 12
  • 13. Figure 3: Botnet detection approaches taxonomy: Detection target. • C&C detection, exploits the fact that all the botnets require a command and control channel for sending commands and controlling the bots, that can also be used to report and exfiltrate information from the infected machine; • Bot activity detection, exploits the fact that all botnets to be a valuable asset for the botmaster have to perform some kind of malicious activity triggered by the reception of a particular command sent by the botmaster itself. In the following, the characteristics of the two are deepened and proposals from the liter- ature are presented to give a better understanding of their different capabilities. 3.2.1 C&C detection C&C detection leverages the fact that all the botnet must have a C&C channel to obtain the commands given by the botmaster. There are many different possible channels and communication techniques employed by botnets, as already described in Section 1.2. this imply that are many the different techniques available to track the different kind of botnet channels employed in practice. The advantage of this technique is that by focusing on the detection of the C&C channel it is possible to achieve early detection and identify infec- tions prior the execution of malicious operations. Clearly, if the communication channel used is particularly sthealty (e.g. social networks) it could be difficult to detect the botnet by looking only at the C&C communications. A big share of the work available in the literature focuses on C&C detection, that is performed by employing different techniques: supervised machine learning classifiers [25] [34] [48] [16] [19], clustering [3] [46] [45] [40], data mining [22] [20], graph based approaches (mostly for P2P botnets) [5] [27] [8] and many more [39] [9] [17] [44] [14] [31] [29] [42] [21] [7] [36]. Strayer et al. [39] proposed a system looking for evidence of command and control activity by examining flow characteristics and identify evidences of botnet activity in the moni- tored network. It exploits the fact that some botnets exhibit thight command and control interactions between the involved actors, showing timeliness in the reception and answer- ing to the command-response interactions. This is used to detect botnets employing IRC channels by (i) filtering meaningless traffic through non-chat-like traffic and white/black listing, (ii) flow-based classification using machine learning classfiers to reduce the traffic by preserving only candidate IRC botnet flows, (iii) temporal and similar (in terms of size) flows correlation to group similar flows into clusters of related flows on the base of 13
  • 14. an Euclidean-like distance and (iv) topological analysis to identify topological structures in the automatically generated graphs of the clustered flows (e.g. common endpoint, ren- dezvous points, etc.) typical of IRC based botnets to suggest further investigation for the hosts belonging to the suspicious cluster. 3.2.2 Bot activity detection Since all the botnets, to be profitable from the point of view of the botmaster, sooner or later will have to perform some malicious task, this fact can be used by the detection engine to identify bot infections; this is exactly the case of bot activity detection tech- niques. The advantage of this kind of techniques is that they are more likely to identify infections, since malicious activities typically performed by the bots (except for data ex- filtration, which is a very stealthy activity) are easier to observe with respect to C&C communication, that can be very stealthy. The drawback is clear: we have to wait for the bot to perform some malicious activity, sacfifying early detection advantages of C&C detection. This is why some work proposed hybrid solutions combining C&C detection with bot activity detection to improve the performance. A good number of solutions pro- posed in the literature is built on bot activity detection; there are botnet life-cycle based solutions [12] [18], group behavior analysis solutions [11] [13], signature based solutions [33] [24] [41], DNS based solutions [4] [43] [35] and more [26] [23]. Many proposals described in this document focus on detecting the activities performed by the bots rather than their C&C channels. This is the case of the work by Yadav et. al [43], described in section 3.1.1, that exploits the anomalous DNS activity pattern generated by Domain Generation Aalgorithms to detect botnets employing Fast-Fluxing techniques, the case of BotHunter by Gu et al. [12], described in later sections (3.3.2), that leverages the dialog warnings generated by an IDS deployed in the monitored network to detect the botnets through the observed malicious activities and in particular is the case of BotTracer by Liu et al. [24], described in section 3.3.1, that generates signature based models by dynamically executing the malware in a virtualized environment, and detects bot infected hosts by observing the activities performed by this one in the virtualized environment and matching those in real time with the ones performed by the host in real world. 3.3 Feature source Detection methods can be then characterized in terms of source of information from which the feature used by the detection engine are extracted. The different source of information employed in the field of botnet detection, shown in Figure 4, are divided into two categories: Host-based and Network-based. Their differences are the following: • Host-based information, obtained by monitoring and analyzing the internals of a computer system for signs of infection, instead that looking at the network traffic on the external interfaces; • Network-based information, obtained by monitoring and analyzing the network traffic to detect signals of botnet activities. It can be further divide into: 14
  • 15. – Payload inspection, where the entire packet content is used to derive features; – Network flows, where the header of packets is used to reconstruct the network flows involving the monitored hosts and to derive features based on statistics computed on the characteristics of these flows; – DNS traffic, where anomalies in the kind, timing and amount of DNS traffic observed in the network are used to derive features; – Network connection graphs, where graphs are constructed on the base of the observed connections involving the monitored hosts, with the goal to identify structural patterns signaling bot infections; – Botnet life-cycle pattern, where the typical life-cycle model, from infection to malicious activity exhibition, observed in botnet infections it is used as an anomaly pattern to identify infected machines by looking at the sequence of states or activities observable in the network traffic they get through; – Group activities, where monitored machines in the network are checked to identify hosts performing operations similar in kind and timing to derive groups of hosts, signifying the presence of multiple bot infected hosts in the same network that are likely to belong to the same botnet; Figure 4: Botnet detection approaches taxonomy: Feature source. 3.3.1 Host information In this case, information from the underlying machine it’s leveraged to detect the presence of bot-infection. This is not very common in practice: since botnet, due to their design, are forced to produce network traffic, it’s much more typical to leverage network-based 15
  • 16. information to tune the detection algorithms. However, there are in the literature some example of detection mechanism leveraging host-based information gathered from the in- ternal of the computer system to detect bot infections [33] [42] [24] [26]. Probably, one of the most representative example is given by BotTracer, proposed by Liu et al. [24]. BotTracer works by exploiting the following observations: (i) bots are designed to automatically start up at the system boot (typically) by modifying the auto- matic startup process list or registry entries, differently from other withelisted (i.e. the ones the user give the explicit consent for this to happen) applications within the system; (ii) all the bots need to establish a C&C channel with the botmaster, typically few mo- ments after their automatic start-up; (iii) sooner or later, all bot will have to perform some malicious operation in the system, typically consisting in data exfiltration or participation in coordinated attacks (e.g. DDoS attacks). BotTracer is capable to detect bot infections by cloning the underlying system into a virtual machine; this virtual machine serves as a real-time testbed: it runs in parallel to the real physical system, but without the noise generated by user’s operations, to better detect the signs of bot infection (i-iii) whenever this are happening also in the real one. However, these kind of host-based systems are not meant to substiute network-based ones, but rather to complement them, enforcing better detection capabilities. 3.3.2 Network information The majority of the proposals found in the literature leverages network-based informa- tion for the purpose of botnet detection. The reason is clear: all botnets must generate some network traffic, at least to get in touch with their botmaster and receive further in- structions. Since the traffic generated by those can be characterized in many dimensions, methods using network information to derive their features further decompose in many categories, described in details in the following. Payload inspection: The entire content of packets is used to extract features for the de- tection phase. This ideally enables to obatain more information from the observed traffic at the price of more processing overhead. The main problem, that lead to a drastic drop in the popularity of this method is the employing of encryption and obfuscation techniques by newer botnets, impeding to derive meaningful knowledge from the content of packets. A significant number of works employing payload-based information have been proposed in the literature, employing payload information for signature generation [33] [9] [14] [41] [42] [21], traffic and behavior aggregation [11] [13] [44] and more [12] [18]. Lee et al. [21] proposed an approach to automatically generating payload-based models from botnet and benign traffic traces, to be then used to match packets against this set of automatic generated signature and detect infections. The approach has two main phases: (i) in the learning phase, packets of the input traces are first grouped accoring to their size into clusters representing payload lenght ranges (containing both benign and botnet packets), then token signatures consisting of byte sequences at specific offset are identified 16
  • 17. within the packets of the clusters and dividing each range cluster in benign and botnet signatures, to form the payload-based models. Then, to improve efficiency and reducing the memory overhead, a model reduction steps is carred out to remove irrelevant signa- ture values with low discriminative power or frequently occurring in the benign model (leading to false positives) in the botnet model, turning this latter into a probabilistic model that considers the probability of a certain token to appear at a specific offset across all the tokens appearing in that position. Finally, (ii) the model is leveraged to match the features extracted from observed packets and to trigger an alarm when a threshold of matched signature is reached. The approach achieves high detection rates (94%) and low false positive rate (0.9%), but like all the payload-based models may suffer if encryption is used in the proper way (i.e. salting the original payload to create irrelevant signature). Network flows: Information originating only from the packet header is extracted and used to reconstruct network flows out of the network traces, for computing statistics and characterizing the observed network traffic. Using only header information means that some information will be dropped: the advantage here is making ineffective botnet contermeasures such as packet payload encryption and obfuscation. Once the network flow statistics are computed, many are the possibilities to derive the conclusions such as data mining [26] [22] [20], machine learning [25] [34] [48] [36] [16], behavior clustering and traffic aggregation [39] [3] [23] [46] [45] [40] [13] [11] and many more [17] [31] [5] [27] [8] [12]. Tegeler et al. [40] designed BotFinder, a system that detects bot infections leveraging only high-level properties of the bot’s network traffic and without performing deep packet inspection. The system works in two main phases: training phase and detection phase. In the training phase, bot samples of different families are run in a testbed environment to extract botnet traces; then, if flow information is not available, flows are reassembled from the captured packet data (to obtain NetFlow data) and traces are extracted by chronologically ordering and grouping all the flows between two endpoints. From the so obtained trace, it is possible to extract the five relevant features used by BotMiner for the botnet model creation: (i) average time interval between two flows in the trace (since bots typically exploit regularity); (ii) average duration of the connections in the trace (since bot connection have typically small duration); averages of (iii) source bytes and (iv) destination bytes within the trace (since bots perfom many similar connections); (v) Fast Fourier Transform (FFT) over a binary sampling of the trace, to detect underlying communication regularities in the C&C connection. The models are then created by clus- tering over the five relevant features, by processing the dataset once for each feature and describing the model representing the botnet family by means of this five set of clusters. In the detection phase, observed traces are matched againts the models, comparing each statistical feature of the trace with the model’s clusters: we have an “hit” each time a feature of the trace belongs to one of model’s clusters and increase the current score in a way proportional to the quality of the cluster and of the trace’s feature. Different scores are kept for each model: detection is triggered whenever the highest score is above a global pre-defined acceptance threshold. The method achieves good detection (above 90%) and false positive rates (around 1%) and showed capable of detecting novelt infections. 17
  • 18. DNS-based: In addition, it is possible to use only the information obdtained from DNS traffic. This because, as already stated in chapters 1.2.4 and 3.1.1, botnets may produce an anomalous amount of DNS traffic for different reasons (e.g. DGAs, covert channels, exc.). This fact can be leveraged to derive very valuable information from DNS traffic in- formation and to extract features used for detection purposes. For this reason, we can find in the literature some proposals employing only DNS traffic to derive detection conclusion such as [43] described in chapter 3.1.1, exploiting the anomalous DNS traffic generated by DGAs and other similar methods [4] [35]. Network connection graphs: These methods build the communication graph on the base of the connections involving the hosts in the observed network to identify anomalous structures and patterns typical of botnet infections [39] [5] [27] [8]. One of the advantage of these methods is the possibility to design and use existing (e.g. PageRank7 [8]) powerful graph and link analys algorithms to identify bot infections with high confidence. The clear problem is that this kind of analysis has different results depending on the architecture of the botnet to be detected; very good results are achieved for P2P botnets, while the same is not true for centralized ones. However, the majority of new botnets are built upon P2P overlay networks and many additional methods has been proved to be effective against centralized botnets (3.1.1); so those methods offers a very good alternative to other anomaly based ones and can be also combined to other techniques, such as cluster analysis to achieve even better performances [8]. An example of graph based detection mechnisms is the work by Coskun et al. [5], that proposed a detection method for de- tecting P2P botnets by observing the fact that the member of those ones are very likely to communicate with at least one common external bot during a given time window i.e. they have a so-called mutual contact. The work, described in 3.5.2, leverages this fact to construct the graph of mutual contacts and identify even dormant (i.e. having performed no malicioustask yet) bot by running on this one a“Dye-Pumping” algorithm designed by the authors and achieving very good results. Botnet life-cycle pattern: The idea is to use the typical pattern botnet sample ex- hibit upon infection, as an anomaly signaling botnet infections. More specifically, the majority of botnets performs the following pattern of actions upon infection of some host [12]: (i) external to internal inbound scan, to identify the target host to infect; (ii)external to internal inbound exploit, to take control of the machine for downloading the malware sample; (iii) internal to external binary acquisition, to download and install the malware sample and infect the target machine; (iv) internal to external C&C communication, to communicate with the C&C server and receive the commands given by the botmaster; (v) internal to external outboud infection scanning, to scan for external vulnerable hosts to infect with the malware sample and enlarge the botnet. This technique typically leverages dialog warnings generated by IDS deployed within the perimeter of the monitored net- work, collecting and correlating them for signals of the above activities (i-v): detection is typically triggered when a sufficient number of dialog is observed. These techniques, being 7 https://en.wikipedia.org/wiki/PageRank 18
  • 19. more general, are more flexible than model based ones and typically enables to detect an higher number of infection. The only problem is that they typically have to go through all (of most) of the above life-cycle steps (i-v), sacrificing early detection advantages. Gu et al. [12] proposed BotHunter, which uses the dialog correlation strategy to detect sign of bot infected hosts in the monitored network looking for the activities characterizing the infection life cycle. BotHunter correlator is built on top of the IDS Snort8, using a customized malware-focused rule set keep constantly updated by the Snort community. In addition, to complement this signature based detection capability of Snort, BotHunter extends the capabilities of this latter with two additional plug-ins: SLADE and SCADE. SLADE (STatistical PayLoad Anomaly Detection Engine) is an anomaly-based engine for payload exploit detection through lossy n-gram byte-distribution analysis, to identify deviation from a normal traffic profile. SCADE (Statistical sCan Anomaly Detection En- gine) is a scan detection plug-in constituted by two modules: an inbound scan detection module, that tracks scan toward internal monitored hosts and calculates an anomaly score based on the number of received probes weighted by the specific port receiving the probe (two possibilities: High Severity for known highly vulnerable and exploited services and Low Severity) and an outbound scan detection module based on a voting scheme (AND, OR or MAJORITY ) of three independent anomaly detection models considering (i) the outbound scan rate, for detecting hosts performing high-rate scans across a large set of external hosts, (ii) the outbound connection failure rate, for detecting abnormally high failure rates weighted by High/Low severity port usage and (iii) the normalized entropy of scan distribution, calculating the distribution of outbound connection patterns to look for uniformly distributed targets pattern typical of bot scans. Dialog generated by the signature based and anomaly based engines are classified according to the botnet life-cycle model activities (i-v) in a network dialog correlation matrix managed by the dialog cor- relation engine, enlarging upon dialog generation and shrinking upon dialog expiration, that works on the base of soft and hard time intervals, upon which dialog are pruned and aggregated respectively. When a dialog sequence resulting from the dialog correlation procedure is found to cross the threshold for bot declaration, a bot infection is alarmed and a bot profile is produced, characterizing the infection on the base of the received dialogs, its duration and its participants. For a long time, due to its capabilities and the constant development, BotHunter has been considered as the de-facto state-of-the-art for botnet detection and due to it’s open-sourceness (a stand-alone version can be freely downloaded here9) used to compare the performance of newer botnet detection techniques. Group activities: Those methods leverage the presence of anomalous temporal coordi- nation in the kind of activities performed by the hosts of the monitored network. Botnets are characterized by the presence of entire clusters of machines performing the same kind of operation in coordination: this is given by the fact that multiple hosts are infected by the same malware and sharing the same commands submitted by the botmaster control- ling the entire botnet. This means that by monitoring the network, in case of multiple 8 https://www.snort.org/ 9 https://www.metaflows.com/wiki/Stand-Alone_BotHunter 19
  • 20. bot infections, we can observe groups of hosts performing similar operations in the same time slots, hence sharing the same traffic behavior. The main advantage of this method is that, by correlating the information from multiple hosts, there are stronger evidences of infections, that reflects positively on the false positive and detection rates. Also, group analysis based techniques are much more difficult to evade for botnet developers since co- ordination among the bots characterizes all bot infections: if the botmaster gave specific commands to each single bot to not show group behavior, then he would lose the advan- tage of having many bots performing the same task that is the main advantage of botnets and also making more difficult to control the botnet itself. He could structure the botnets into different subset of hosts performing different operations, but this would increase the botnet complexity and would not completely evade group behavior analysis techniques, since he should also check that no two bots of the same subset ends up in the same network (which is very complex and unattractive for botmasters). Hovever this method has a clear drawback: it requires multiple hosts to be infected in the same network and by the same malware to achieve good performances and have a reasonable number of false positives; but, even if botnets typically show lateral spreading characteristics, this is not always the case. For this reason, it is not uncommon to employ additional mechanisms to cope with the possibility of unique (or few) infections. The kind of group activities that are typically observable and used by botnet detection methods are found in the DNS traffic [4] [35] or in the C&C communications and botnet activities [11] [13]. Gu et al. proposed BotSniffer [11] and the later BotMiner [13] that is probably the most representative work in the category of group behavor analysis detection methods. They work behind the principle that botnets, differently from normal activities, demon- strate a synchronized and correlated behavior, both in receiving the command through the C&C channel and in anserwering to this one responding in the same channel and/or per- forming some activity. Their early work, BotSniffer [11], focused on detecting push-based (e.g. IRC-based) and pull-based (e.g. HTTP-based) botnets by devising anomaly based detection algorithms for detecting bots belonging to these models in a port-independent way and with no prior knowledge. BotSniffer works by detecting spatial-temporal corre- lations and similarities characterizing in the message and activity responses of the bots belonging to the same botnet. After a traffic filtering and C&C-like protocol matcher to drop unmeaningful traffic, it identifies message responses by monitoring specific C&C pro- tocol messages (e.g. IRC PRIVMSG) and uses two anomaly detection modules, namely Abnormally High Scan Rate and Weighted Failed Connection Rate for scan detection (plus an additional modules for spam detection) to identify activity responses. Then, in the correlation stage, it employs two anomaly-detection algorithms for group activ- ity/message response analysis based on a Threshold Random Walk (TRW), enabling to set strict bounds on the false positive/negative rates. Response-Crowd-Density-Check al- gorithm, for each time window checks for the presence of a dense response crowd within each group of hosts connecting to the same server: if there’s a fraction of clients within the group with message/activity response behavior larger than a threshold, then we say that they form a response crowd. If we observe multiple response crowd within the same group we have high confidence that the group is likely part of a botnet or not and accept that 20
  • 21. hypotesis in the statistical test, that is computed by observing many rounds in a TRW until one of the threashold is reached. The same round based decision is performed for the second algorithm, Response-Crowd-Homogeneity-Check, that instead of looking at the density (i.e. the number of hosts) of the responses within the group, it looks at the homo- geneity of the observed message responses (computed in terms of a n-gram analysis based distance) observed within each time window for the hosts within that group, to identify homogeneus crowds. In addition, there are specific checks that may lead to detection even if only a single host is found infected, by exploiting the broadcast characteristics of IRC channels and applying the same algorithms on the observed incoming messages rather than observed outgoing ones for the IRC-based ones and the periodical visiting patterns for the HTTP-based ones. BotMiner [13] instead, makes no assumption on the C&C channel or architecture; the original idea of BotSniffer is improved by employing the machine learn- ing based technique of clustering, to clusters similar message response and C&C traffic in a C-plane specific for C&C communication traffic clustering and similar activity response in a A-plane specific for malicious activity clustering and then performing cross-plane correlation cross-checking clusters in the two planes to find out intersection and reinforce evidence of bot infection. First, the C-plane and A-plane monitor, deployed at the edge of the network, run in parallel and capture respectively who is talking to whom, by tracking UDP and TCP flows, and who is doing what, analyzing the outbound traffic for signs of malicious activities such as scanning (in the same way as BotSniffer), spamming, binary downloading and exploit activities detection implemented as Snort plug-ins. Then, for C-plane clustering, after traffic basic and white-listing filtering, vector representation of C-flows are computed by extracting meaningful features such as the number of flows per hour and the number of packets per flow and a two-step clustering process clusters first similar flows on a reduced set of dimensions and then refines the results by generating smaller and more precise clusters by running a clustering process on each step-1 cluster by considering all the dimensions. Instead, for A-plane clustering, for all the host performing at least one malicious activity, those are clustered first according to the type of the activ- ity; then for each activity type, hosts are further clustered into smaller and more precise clusters according to the specific activity features. Finally, C-plane and A-plane clusters are cross-checked to find out intersection and reinforce the evidence of bot infection. In particular, for all the hosts having performed at least one malicious activity, a botnet score is computed by assigning an activity weight to each cluster and summing up the weights of the clusters this host belongs to; thus an host will have an higher score if it ends up in more A-clusters or if ends up in a C-cluster with large overlap with A-clusters. For those hosts, a similarity metric is computed looking at the set of A-plane and C-plane clusters it belongs to taken as a bit-mask: this similarity is then used to apply hierarchical clustering and to build a dendrogram encoding the relationship between hosts, to identify dense and well separated clusters of (sub-)botnets with very good results (99, 6% in the worst case). 3.4 Feature extraction One way in which botnet detection methods differs is the way in which the features used for analysis are extracted. As depicted by Figure 5, there are two main way of thinking: passive approaches and active approaches. 21
  • 22. Figure 5: Botnet detection approaches taxonomy: Feature extraction. Passive approaches are the most typical ones and the ones traditionally used in the almost totality of botnet detection mechanisms. They involve detecting the botnet only by observing and analyzing their activities and connections without actively participating in their operations. Instead, Active approaches are mainly used by researchers for studying and characterizing specific botnet samples and families, understanding their behavioral and distinctive features. They are “active” because, differently from the passive ones, they involve actively participating into the botnet operations by controlling one or more active nodes belonging to (or at least pretending to) the botnet. These can be divided in: • Fast-flux tracking: Monitor and identify DNS servers with low TTL values that may indicate the presence of Fast-Flux Service Network (FFSN) mechanism typically employed by botnets; this technique has been explored by Passerini et al. [32], that proposed FluXOR, a tool aiming at identifying and characterizing FFSNs exploiting feature based on (i) the TTL of DNS resource records (FFSN ones are short-lived), (ii) the number of distinct IP addresses the domain is resolved to (high for FFSNs, to have high availability) and (iii) the heterogeneity of the organizations owning these IP addresses (high for FFSNs), that are then combined by means of a Na¨ıve Bayes Classifier to classify and identify FFSNs (387 in just two months) that the researchers believe be associated with at least 16 botnets. Similar results are obtained in the later work by Zhao et al. [47], proving the effectness of this methods also on P2P based bots using FFSNs for their malware distribution servers like Storm10. 10 https://en.wikipedia.org/wiki/Storm_botnet 22
  • 23. • C&C server hijack: C&C servers/botmaster peers are seized to discover infor- mation on the botnet topology and the list of involved peers. Seizure of the C&C servers/botmaster peers could be either: – Physical, when defenders/researchers physically hijacks the botmaster con- trolled C&C server/peers; this is the case of Stone-Gross et al. [37], that obtained the control of 16 C&C servers of the Cutwail botnet11 by getting in touch with the hosting providers these server belonged to, showing evidences of bot infections and obtaining the access to those servers, enabling them to control the entire botnet for research purposes. – Virtual, when defenders/researchers manage to redirect C&C communications to a machine controlled by those ones; this technique has been employed by Stone-Gross et al. [38], that were able to obtain the control of C&C server of the Torpig botnet12 by reverse engineering the Domain Generation Algorithm of the botnet and registering the future domains in advance, enabling them to control and study the botnet operation for ten days. • Infiltration: A defender-controlled machine masquerades as a bot and probes the C&C server/P2P bot peers to progressively gain increasing information on the botnet protocol and on other peer participating the botnet. This approach has been used by Xu et al. in Cyberprobe [29] and the later Autoprobe [42], that by infiltrating in the botnet operations were able to identify previously unknown C&C servers in Internet-wide scale. They work by learning traffic fingerprints through the replay of the messages obsverved in known malicious traces and comparing the responses in the case of Cyberprobe and by sample execution, code branch exploration and trace analysis to extract a set of symbolic equation used to fingerprint the candidate C&C server response in the case of Autoprobe. • Honeypot based: Deliberately vulnerable machines are exposed in the Internet to get infected by malware samples, that are then studied by defenders/researchers to prevent the compromise of real machines; this technique has been studied and deeply researched by the Honeynet project13 and presented in the work by B¨acher et al. [33], where honeypots are safely infected exploiting a firewall (honeywall) to trap the botnet infection by controlling the outgoing traffic and blocking its communication to the outside and enabling to study the sample behavior in safety. • Sinkholing: Defenders/researchers controlled machines are injected into P2P bot- nets peer lists with the goal of bots enumeration and botnet mitigation by incre- mentally pushing all the bots to interact only with the sinkhole peers; this technique is typically used to seize and study P2P botnets such as ZeroAccess14, sinkholed by the security researchers at Symantec. [1]. 11 https://en.wikipedia.org/wiki/Cutwail_botnet 12 https://en.wikipedia.org/wiki/Torpig 13 https://www.honeynet.org 14 https://en.wikipedia.org/wiki/ZeroAccess_botnet 23
  • 24. • DNS cache snooping: DNS servers cache is analyzed to identify illegitimate or unexpected DNS queries, the presence of known or suspicious DNS records and how much a domain it is frequently queried, signaling the presence of a botnet; the capabilities and possible use-cases of this technique are explored by Grangeia [10]. • Suppression: Incoming/outgoing packets in suspicious network flows are suppressed to elicit known response from any of the ends of the C&C communication; the goal can be the one to trigger bot C&C back-up mechanisms, by blocking communi- cation to primary C&C server, as proposed by Neugschwandtner et al. [30], that developed a tool called SQUEEZE, capable of detecting back-up C&C servers of malware through a dynamic execution based on multipath exploration, able to re- vert the malware to its state when performing a connection thorugh virtual machine snapshotting, to explore both branches of allowing and blocking the connection. • Injection: Works by identifying C&C communications by injecting packets within suspicious network flows and by checking the similarity of the responses to the in- jected packets with known bot responses for C&C channel identification; this tech- nique is leveraged by BotProbe, proposed by Gu et al. [14], to separate human chat-like IRC communications from bot IRC communication using a set of 4 hyphote- sis test leveraging (i) a challenge base Turing-Test-Hypothesis, (ii) a Single-Binary- Response-Hypothesis to check wether a client response is observed, (iii) a Correlation- Response-Hypotesis test to check the homogenety of responses to the same message and (iv) a Interleaved-Binary-Response-Hypothesis test to check changes in the re- sponses when modified variants of the same message are sent. 3.5 Feature correlation Botnet detection methods also differentiates in the way the different features collected for detection purposes are analyzed and combined to derive conclusions regarding the detection of possible infections. In practice, what happens is that many small bits of information from potentially different kind of sources are correlated to obtain an higher level summarized knowledge, later used in the detection phase; the different ways in which feature can be correlated are shown in Figure 6. Two are the possible correlation schemes: Figure 6: Botnet detection approaches taxonomy: Feature correlation. • Vertical, correlating the sequence (or history) of activities performed by a single host that are then compared with known models of bot behavior; 24
  • 25. • Horizontal, correlating the activities performed by different hosts in the monitored network and detecting bots by observing similarities in the timing and kind of op- erations, to detect coordinated activities that are typical of botnet infections. 3.5.1 Vertical correlation The detection is triggered by correlating a chain of suspicious activities performed by a single host, signaling a bot infection at this one premises. Many of the literature proposal employ vertical based correlation schemes to derive the proper detection knowledge and they variety confirms the flexibility of this correlation scheme: there are data mining based solutions [26] [22] [20], machine learning based classifiers [25] [34] [48] [16] [19], NIDS based hybrid solutions [12] [18] and more [17] [31] [7] [35] [3] [40] [43] [24] [14] [41] [29] [42] [21]. Masud et al. [26] proposed an approach employing data mining techniques to two different kind of log files taken at a specific host: (i) a tcpdump log file, taken with WinDump15, containing the packet trace captured at the host and an (ii) exedump log file, generated by a process tracer implemented by the researchers to track all the application lauched by the host. Traces were collected from uninfected and infected machines (using virtual machines, running Windows XP and infected with malware samples) to obtain a dataset used to extract flow-based features through data mining techniques and used to train and test machine learning based classifiers. The idea is to correlate the events between the two logs to detect cause-effect events upon command submission typical of bot infections, namely observable commands: (i) bot-response, when the command solicit a response from the infected host, (ii) bot-app, when the command causes an app to start and (iii) bot- other, when the command causes the infected host to contact some other infected host. Upon this observable commands, statistical features are built on, by correlating the events between the two logs and are used to derive classifiers to detect future bot infections by mining real-time logs to extract the feature used to feed the classification task. 3.5.2 Horizontal correlation In this case, the correlation does not focus on a specific host, but rather exploits the fact that bot infection are characterized by the presence of coordinated activities performed by multiple hosts sharing the same infection. This is caused by the fact that commands are issued on a coarse-grain base, typically involving entire portions of the overall botnet: so it’s very likely, for multiple infections of the same malware, to exhibit coordination in performing some activities (e.g. querying a specific domain, DoS-attacking a specific host, etc.). This kind of correlation can be more effective in detecting infections, but it fails miserably if there are few or a single infection in the monitored network. This is why are typically paired with some additional analysis mechanism, such as a vertical one or a signature based one. In the literature we can identifiy different possible solution built upon horizontal based correlation schemes: some of them employ group behavior analysis [11] [13] [4] [35], graph based approaches [5] [8] [27], clustering based classification [3] [23] [46] [45] [31] and more [39] [44] [36]. 15 https://www.winpcap.org/windump/ 25
  • 26. Coskun et al. [5] proposed a detection method that aims to detect P2P botnets by observing the fact that the member of those ones are very likely to talk to at least one common external bot in a given time window. In other words, depending on the size on the botnet, there’s a significant probability for two peers belonging to the same P2P botnet to have at least one mutual contact in a given time frame. Using this information, a mutual contact graph it is built, where there’s an edge between each pair of nodes having at least a mutual contact, with a weight put on top of it representing the cardinality of the mutual contact set for this two nodes. To avoid false positives (i.e. connection to a very popular external server, like google.com), only private mutual contact, communicating with less that a privacy-threshold k host, are considered within the possible mutal contact of two nodes. For computing which hosts is likely belonging to a P2P botnet, an algorithm (“Dye-Pumping” algorithm) it is run on top of the graph, to compute the confidence level of the host being part of the same P2P botnet of the seed node (i.e. a node that have been observed performing malicious task or for which there are evidences of bot infections) and from which the algorithm is started. The algorithms starts by assigning the seed the dye it can move to each of its neighbour nodes: for each edge, the assigned dye is the ratio between the mutual contact of those nodes and the degree of the seed node, powered to a Node Degree Sensitivity Coefficient to balance the dye assigned to high degree nodes. The algoritm goes on for a fixed maximum number of iteration, pumping the available dye to the neighbours in proportion to the weight of the edges. When the algorithm terminates, the nodes having an amount of dye (representing the confidence level for that node be- ing part of the botnet) greather that a threshold are detected as P2P bots, enabling the identification also of dormant bots that didn’t reveal their malicious nature. 3.6 Detection method Finally, botnet detection solutions can be characterized in terms of the algorithms analyz- ing the input data and deriving conclusions, so how the detection and the identification of bots it is carried out. As Figure 7 shows, algorithms can be divided into two groups, bor- rowing mainly from the fields of machine learning and data mining, whose performances have been proved to be very high in many areas in the field of data analysis. These are: • Statistical classification, where observation are assigned the proper class label, that is inferred by a classification algorithm; the classification procedure can be: – Threshold based, where an anomaly score for the observations is computed and then compared against a threshold, tuned up on the collected observations, to infer the right class label; – Supervised learning based, where a supervised learning classifier is trained on top of the collected data to infer the right class label for future observations. • Cluster analysis, where observation are grouped into clusters of similar objects, grouped according to a specific similarity measure; 26
  • 27. Figure 7: Botnet detection approaches taxonomy: Detection algorithm. 3.6.1 Statistical classification Those methods solve the problem of identifying in which set of classes observations belong to, on the base of the knowledge learnt from a labeled training dataset, that is essentialy a set of observations with a known and verified classification. The goal is obtaining a clas- sification algorithm or function capable of inferring bot observation with high accuracy. There are in general two possibilities: Threshold based classifiers and Supervised learning based classifiers, borrowing from the field of machine learning. Threshold based: The principle behind this class of methods is very simple: classify as malicious all the observations looking “too much” suspicious. The thongs that differenti- ates the available methods falling in this category are the way in which the suspiciousness it is computed and at what point a suspicious observation it is labeled as a bot one: namely, they differentiates in terms of the anomaly score calculation and the threshold tuning. The anomaly score can be computed on the base of the anomalies in the DNS traffic generated [4] [43] [35], in the timing and content of the responses signaling automated bot activity [14] [29] [42], in the quantity and kind of alarms generated by network monitor software [12] [18] and more [24] [41] [21] [5] [7]. For what concerning the threshold, in most of the cases it is decided experimentally [9] [5] [4] [43], or at least are the weights for the score to be decided in this way [9] [12]; alternatively, it can be computed algorithmically by employing techniques such as the Threshold Random Walk (TRW)[14]. Example of threshold based techniques are Yadav et al. [43], employing a threshold based technique considering an experimentally computed threshold for anomalies in the DNS traffic, Rishi by Goebel et al. [9], computing a weighted anomaly score to identify anomalous IRC traffic (both described in section 3.1.1), BotHunter by Gu et al. [12], com- puting a weighted anomaly score for the events generated by the Snort IDS and generating profiles of bot infections for all those chain of events exceeding the threshold (descibed in 3.3.2) and more techniques descibed in the previous sections [5] [24] [21]. Supervised learning based: According to the machine learning terminology, we la- bel supervised learning classifiers all those classifiers where the knowledge is extracted from an already labeled training set considered as the ground-truth upon which the class- fier is built on. In this case, the composition of the dataset it is critical: to avoid the 27
  • 28. resulting classifier to underfit (i.e. not capture the underlying trend of the data) or to overfit (i.e. describe the random error or the noise instead of the underlying relationship), quantity and variety are essential properties for the training dataset. Also, a small part of the dataset has to be reserved for testing the classifier: the classifier must be able to cope with unseen behavior, that is possible only in case of not overfitted/underfitted classifiers; so the testing is a critical step for the evalution of the classifier’s performance. In case of fine tuning, supervised learning classification enables for both very high detection rates and very low false positive and false negative rates, explaining the growing interest behind this detection methods. Since there are many different possibilities, it’s very typical to train multiple different classifier and choose then the one showing the more promising results. There are works evaluating Na¨ıve Bayes and J48 decision tree [39] [25] [22], Random Forest [36], Boosted Decision Trees and Support Vector Machines (SVM) [20], Artificial Neural Networks [19], and more [48] [16] [26] [34]. Kirubavathi et al. [20] proposed a botnet detection method via the mining of traffic flow characteristics, that are then used to train machine learning classifiers to achieve high precision and recall rates. The key feature behind the performace of the proposed method is the quality of the features mined from the network flows: by extracting only 4 features, it is cabable to precisely model the most relevant characteristics of botnet traffic to identify bot infected hosts. Those features are: (i) Small packets(Ps), capturing the fact that automated bot behavior is characterized by the exchange of many small packets (in the range of 40-320 bytes) due to their activities of scanning, information harvest- ing and maintainment of the C&C network (especially in the case of P2P botnets); (ii) Packet ratio (Pr), capturing the fact that pre-programmed bots are capable of performing a limited number of (pre-programmed) activities, thus limiting the obverved behavior and so the possible fluctuations in the traffic generated by them (differently from normal traf- fic, that fluctuates much), that is captured by the ratio between the number of incoming and outgoing packets in a given time window; (iii) Initial Packet length (Pl), that cap- tures the fact that the initial exchange of packets follows well defined behavior depending on the (botnet) communication protocol, containing information that can be successfully used to identify the flow as malicious or benign; (iv) Bot Response Packet (BRp), that captures the fact that for automated bots, when an ingoing packet originating from some specific IP-port combination is received, a response it is observed to the same IP-port destination with typically constant response time. Out of this four features, a flow vector it is constructed for each time window, out of the values of those ones in that time window (N.B. the Initial Packet lenght it is calculated only once when the flow is established and then carried to future time windows), whose duration is a critical parameter representing the trade off between accuracy and timeliness of the detection and thus must be carefully tuned through extensive evaluation. The first achievement of this work is showing that is not the number of features that makes the difference, rather their quality; in addition, a small number of features enables for faster and more efficient detection. Three machine learning classifiers are trained and 10-fold cross-validated with datasets containing the flow vectors of benign (FTP services, Web services, BitTorrent applications and more) and malicious (Zeus, SpyEye, BlackEnergy botnets and many more) traces created by the 28
  • 29. researchers in a testbed environment or obtained by leveraging public datasets (e.g. ISOT botnet dataset16), assembling three vast and various dataset to train and test the classi- fiers. The machine learning classifiers used were (i) Boosted Decision Tree, using AdaBoost ensemble algorithm to construct a strong classifier out of a weighted linear combination of weak classifiers. (ii) Na¨ıve Bayes Classifier, a probabilistic classifier based on the Bayes theorem analyzing the relationship between the features and the corresponding class to derive the likelihood (a conditional probability) linking the features to the classes, and (iii) Support Vector Machine, a widely used classification technique trasforming the dataset to an higher dimension where the data instances of the (typically two) classes can be divided into different spaces thorugh a separation hyperplane maximizing the margin between the two closest data point of the different spaces. High precision and recall rates are achieved for all the three classifier for all the three datasets, with very high time efficiency and Na¨ıve Bayes Classifier showing the best performance both in detection and efficency. 3.6.2 Cluster analysis These methods, differently from statistical classification that tries to identify to which class a particular observation belongs to, solve the problem of grouping in clusters obser- vations in such a way that the similarity with the other observations in the same cluster is greather that the ones belonging to other clusters. In other words, the typical goal of cluster analysis is to identify the clusters in way such that the intra-cluster similarity is maximized, while the inter-cluster similarity is minimized, identifying dense and well defined groups of objects. In the machine learning terminology, cluster analysis is con- sidered an instance of unsupervised learning, since they work on unlabeled data having no classification included in the observation; thus, in this case no evaluation on the accuracy of the structure that is output by the algorithm it is performed. Are parameters of the clustering procedure the similarity (or distance) function used and a density threshold or the specific number of clusters, that can be also computed iteratively by incrementally adjust the number of desired clusters and pick the value that gives the higher quality clusters. The similarity metric used in the clustering process is probably the feature that characterizes the most the different methods proposed in the literature; there are clus- ter analysis methods clustering observations on the base of the flow 5-dimensional tuple (IPS, PortS, IPD, PortD, Protocol) simply and then perform other consideration on the identified flows [17] [36], on the base of computed traffic flow characteristics [39] [3] [40], on the base of C&C communication and/or performed malicious activities [13] [23], graph structural characteristics [8] and more [44] [46] [45]. Many of the methods presented in the previous sections are based on cluster analysis; examples are T ¯AMD proposed by Yen et al. [44], presented in 3.1.1 and clustering on the base of statistical features extracted by the flows, BotMiner proposed by Gu et al. [13], presented in 3.3.2 and clustering ob- servations on two independent planes on the base of C&C communication (C-plane) and performed malicious activities (A-plane) to later perform cross cluster correlation through hierarchical clustering, and the work by Zhang et al. [46] [45], presented in 3.1.2 applying multiple steps of coarse-grain and fine-grain clustering first to identify P2P hosts and then 16 http://www.uvic.ca/engineering/ece/isot/datasets/ 29
  • 30. P2P bots, by finding fingerprint clusters characterizing bot behavior and measuring the similarity of those ones to identify bot families. 4 Research directions and conclusions The current detection methods proposed in the literature shows very high performance in terms of detection rates and false positive/negative rates, meaning that the current techn- niques mostly leveraging data mining and machine learning algorithms combined with the more traditional anomaly and signature based approaches will offer a well-tested and solid base ground for future botnet detection approaches. However, Internet of Things (IoT), Machine-to-Machine (M2M) and similar applications will introduce new challenging task from the scalability point of view: the more the devices, the more the traffic that is gener- ated. So, the design of scalable and distributed solutions will be the future challenge that researchers in the field of botnet detection will have to overcome. Some researchers already focused on scalability, mostly in relation to P2P botnet detection [46] [45] (described in section 3.1.2), by employing an efficient combination of coarse-grained and fine-grained clustering steps, reducing the high computational costs typical of cluster analysis based solutions and enabling the detection of stealthy P2P botnets, even when blended within benign P2P traffic. Even more benefits can be obtained by adopting distributed com- putations at least in the steps requiring the most computational effort; for example,[45] increases the performace of the prototype work [46] by performing a two-step clustering for the cost expensive task of performing fine-grain P2P host detection dividing the set of hosts (whose flows can be analyze singularly) across many computational nodes to identify 1-step clusters, whose results are aggregated using a cumulative function. One of the most representative works showing the power of distributed computing in the field of botnet detection is the one proposed by Singh et al. [36], employing the Hadoop implementa- tion MapReduce paradigm both for feature extraction, by submitting traffic traces to the Hadoop Distributed File System (HDFS) and letting Apache Hive17 extract the flow based feature using as key the flow 5-dimensional tuple (IPS, PortS, IPD, PortD, Protocol) for traffic flow aggreagation, and classification, by employing Apache Mahout18, the machine learning library built on top of Hadoop, to train a Random Forest classifier and achieving high true positive rates (above 99%) and low false positive rates (below 0.3%). Conclud- ing, the new challenges introduced by the newest paradigms will require the design of scalable and distributed performance efficient solutions, leveraging the integration with well-tested and well-known parallel and distributed programming frameworks. References [1] R. Gibb A. Neville. ZeroAccess indepth. Tech. rep. Symantec Security Response, 2013. url: http://www.symantec.com/content/en/us/enterprise/media/ security_response/whitepapers/zeroaccess_indepth.pdf. 17 https://hive.apache.org 18 http://mahout.apache.org 30
  • 31. [2] M. Casenove and A. Miraglia. “Botnet over Tor: The illusion of hiding.” In: 2014 6th International Conference On Cyber Conflict (CyCon 2014). Vrije Universiteit: NATO CCD COE, 2014, pp. 273–282. url: https://ccdcoe.org/cycon/2014/ proceedings/d3r2s3_casenove.pdf. [3] Su Chang and Thomas E. Daniels. “P2P Botnet Detection Using Behavior Clus- tering Statistical Tests”. In: Proceedings of the 2Nd ACM Workshop on Security and Artificial Intelligence. AISec ’09. Chicago, Illinois, USA: ACM, 2009, pp. 23–30. isbn: 978-1-60558-781-3. doi: 10.1145/1654988.1654996. url: http://doi.acm. org/10.1145/1654988.1654996. [4] H. Choi et al. “Botnet Detection by Monitoring Group Activities in DNS Traffic”. In: 7th IEEE International Conference on Computer and Information Technology (CIT 2007). Oct. 2007, pp. 715–720. doi: 10.1109/CIT.2007.90. [5] Baris Coskun, Sven Dietrich, and Nasir Memon. “Friends of an Enemy: Identifying Local Members of Peer-to-peer Botnets Using Mutual Contacts”. In: Proceedings of the 26th Annual Computer Security Applications Conference. ACSAC ’10. Austin, Texas, USA: ACM, 2010, pp. 131–140. isbn: 978-1-4503-0133-6. doi: 10.1145/ 1920261.1920283. url: http://doi.acm.org/10.1145/1920261.1920283. [6] Christian J. Dietrich et al. “On Botnets That Use DNS for Command and Control”. In: Proceedings of the 2011 Seventh European Conference on Computer Network Defense. EC2ND ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 9– 16. isbn: 978-0-7695-4762-6. doi: 10.1109/EC2ND.2011.16. url: http://dx.doi. org/10.1109/EC2ND.2011.16. [7] M. Eslahi, H. Hashim, and N. M. Tahir. “An efficient false alarm reduction ap- proach in HTTP-based botnet detection”. In: 2013 IEEE Symposium on Computers Informatics (ISCI). Apr. 2013, pp. 201–205. doi: 10.1109/ISCI.2013.6612403. [8] J´erˆome Fran¸cois et al. “BotTrack: Tracking Botnets Using NetFlow and PageRank”. In: Proceedings of the 10th International IFIP TC 6 Conference on Networking - Vol- ume Part I. NETWORKING’11. Valencia, Spain: Springer-Verlag, 2011, pp. 1–14. isbn: 978-3-642-20756-3. url: http://dl.acm.org/citation.cfm?id=2008780. 2008782. [9] Jan Goebel and Thorsten Holz. “Rishi: Identify Bot Contaminated Hosts by IRC Nickname Evaluation”. In: Proceedings of the First Conference on First Workshop on Hot Topics in Understanding Botnets. HotBots’07. Cambridge, MA: USENIX Association, 2007, pp. 8–8. url: http://dl.acm.org/citation.cfm?id=1323128. 1323136. [10] L. Grangeia. “DNS Cache Snooping or Snooping the Cache for Fun and Profit”. In: UNC Computer Science (2014). [11] Guofei Gu, Junjie Zhang, and Wenke Lee. “BotSniffer: Detecting Botnet Command and Control Channels in Network Traffic.” In: NDSS. The Internet Society, June 18, 2009. url: http://dblp.uni-trier.de/db/conf/ndss/ndss2008.html#GuZL08. 31
  • 32. [12] Guofei Gu et al. “BotHunter: Detecting Malware Infection Through IDS-driven Dia- log Correlation”. In: Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium. SS’07. Boston, MA: USENIX Association, 2007, 12:1–12:16. isbn: 111-333-5555-77-9. url: http://dl.acm.org/citation.cfm?id=1362903. 1362915. [13] Guofei Gu et al. “BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-independent Botnet Detection”. In: Proceedings of the 17th Conference on Security Symposium. SS’08. San Jose, CA: USENIX Association, 2008, pp. 139–154. url: http://dl.acm.org/citation.cfm?id=1496711.1496721. [14] G. Gu et al. “Active Botnet Probing to Identify Obscure Command and Control Channels”. In: 2009 Annual Computer Security Applications Conference. Dec. 2009, pp. 241–253. doi: 10.1109/ACSAC.2009.30. [15] C. Guarnieri. Skynet, a Tor-powered botnet straight from Reddit. Ed. by Rapid7. 2012. url: https://community.rapid7.com/community/infosec/blog/2012/12/ 06/skynet-a-tor-powered-botnet-straight-from-reddit. [16] F. Haddadi and A. N. Zincir-Heywood. “Benchmarking the Effect of Flow Exporters and Protocol Filters on Botnet Traffic Classification”. In: IEEE Systems Journal 10.4 (Dec. 2016), pp. 1390–1401. issn: 1932-8184. doi: 10.1109/JSYST.2014.2364743. [17] Anestis Karasaridis, Brian Rexroad, and David Hoeflin. “Wide-scale Botnet De- tection and Characterization”. In: Proceedings of the First Conference on First Workshop on Hot Topics in Understanding Botnets. HotBots’07. Cambridge, MA: USENIX Association, 2007, pp. 7–7. url: http://dl.acm.org/citation.cfm?id= 1323128.1323135. [18] Sheharbano Khattak et al. “BotFlex: A community-driven tool for botnet detection”. In: Journal of Network and Computer Applications 58 (2015), pp. 144–154. issn: 1084-8045. doi: http://dx.doi.org/10.1016/j.jnca.2015.10.002. url: http://www.sciencedirect.com/science/article/pii/S1084804515002155. [19] G. Kirubavathi Venkatesh and R. Anitha Nadarajan. “HTTP Botnet Detection Us- ing Adaptive Learning Rate Multilayer Feed-forward Neural Network”. In: Proceed- ings of the 6th IFIP WG 11.2 International Conference on Information Security Theory and Practice: Security, Privacy and Trust in Computing Systems and Ambi- ent Intelligent Ecosystems. WISTP’12. Egham, UK: Springer-Verlag, 2012, pp. 38– 48. isbn: 978-3-642-30954-0. doi: 10.1007/978-3-642-30955-7_5. url: http: //dx.doi.org/10.1007/978-3-642-30955-7_5. [20] G. Kirubavathi and R. Anitha. “Botnet detection via mining of traffic flow charac- teristics”. In: Computers Electrical Engineering 50 (2016), pp. 91–101. issn: 0045- 7906. doi: http://dx.doi.org/10.1016/j.compeleceng.2016.01.012. url: http://www.sciencedirect.com/science/article/pii/S0045790616000148. [21] C. N. Lee, F. Chou, and C. M. Chen. “Automatically Generating Payload-Based Models for Botnet Detection”. In: 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity). Dec. 2015, pp. 1038–1044. doi: 10.1109/ SmartCity.2015.206. 32
  • 33. [22] W. H. Liao and C. C. Chang. “Peer to Peer Botnet Detection Using Data Mining Scheme”. In: 2010 International Conference on Internet Technology and Applica- tions. Aug. 2010, pp. 1–4. doi: 10.1109/ITAPP.2010.5566407. [23] Dan Liu et al. “A P2P-Botnet detection model and algorithms based on network streams analysis”. In: 2010 International Conference on Future Information Tech- nology and Management Engineering. Vol. 1. Oct. 2010, pp. 55–58. doi: 10.1109/ FITME.2010.5655788. [24] Lei Liu et al. “BotTracer: Execution-Based Bot-Like Malware Detection”. In: Pro- ceedings of the 11th International Conference on Information Security. ISC ’08. Taipei, Taiwan: Springer-Verlag, 2008, pp. 97–113. isbn: 978-3-540-85884-3. doi: 10.1007/978-3-540-85886-7_7. url: http://dx.doi.org/10.1007/978-3-540- 85886-7_7. [25] C. Livadas et al. “Using Machine Learning Technliques to Identify Botnet Traffic”. In: Proceedings. 2006 31st IEEE Conference on Local Computer Networks. Nov. 2006, pp. 967–974. doi: 10.1109/LCN.2006.322210. [26] M. M. Masud et al. “Flow-based identification of botnet traffic by mining multiple log files”. In: 2008 First International Conference on Distributed Framework and Applications. Oct. 2008, pp. 200–206. doi: 10.1109/ICDFMA.2008.4784437. [27] Shishir Nagaraja et al. “BotGrep: Finding P2P Bots with Structured Graph Anal- ysis”. In: Proceedings of the 19th USENIX Conference on Security. USENIX Se- curity’10. Washington, DC: USENIX Association, 2010, pp. 7–7. isbn: 888-7-6666- 5555-4. url: http://dl.acm.org/citation.cfm?id=1929820.1929830. [28] Shishir Nagaraja et al. “Stegobot: A Covert Social Network Botnet”. In: Proceedings of the 13th International Conference on Information Hiding. IH’11. Prague, Czech Republic: Springer-Verlag, 2011, pp. 299–313. isbn: 978-3-642-24177-2. url: http: //dl.acm.org/citation.cfm?id=2042445.2042473. [29] Antonio Nappa et al. “CyberProbe: Towards Internet-Scale Active Detection of Ma- licious Servers.” In: NDSS. The Internet Society, 2014. url: http://dblp.uni- trier.de/db/conf/ndss/ndss2014.html#NappaXRCG14. [30] Matthias Neugschwandtner, Paolo Milani Comparetti, and Christian Platzer. “De- tecting Malware’s Failover C&#38;C Strategies with Squeeze”. In: Proceedings of the 27th Annual Computer Security Applications Conference. ACSAC ’11. Orlando, Florida, USA: ACM, 2011, pp. 21–30. isbn: 978-1-4503-0672-0. doi: 10 . 1145 / 2076732.2076736. url: http://doi.acm.org/10.1145/2076732.2076736. [31] S. K. Noh et al. “Detecting P2P Botnets Using a Multi-phased Flow Model”. In: 2009 Third International Conference on Digital Society. Feb. 2009, pp. 247–253. doi: 10.1109/ICDS.2009.37. 33
  • 34. [32] Emanuele Passerini et al. “FluXOR: Detecting and Monitoring Fast-Flux Service Networks”. In: Detection of Intrusions and Malware, and Vulnerability Assessment: 5th International Conference, DIMVA 2008, Paris, France, July 10-11, 2008. Pro- ceedings. Ed. by Diego Zamboni. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 186–206. isbn: 978-3-540-70542-0. doi: 10.1007/978-3-540-70542-0_10. url: http://dx.doi.org/10.1007/978-3-540-70542-0_10. [33] Honeynet Project. Know your Enemy: Tracking Botnets. Published on the Web. url: http://www.honeynet.org/papers/bots/. [34] S. Saad et al. “Detecting P2P botnets through network behavior analysis and ma- chine learning”. In: 2011 Ninth Annual International Conference on Privacy, Secu- rity and Trust. July 2011, pp. 174–180. doi: 10.1109/PST.2011.5971980. [35] Reza Sharifnya and Mahdi Abadi. “DFBotKiller: Domain-flux botnet detection based on the history of group activities and failures in DNS traffic”. In: Digital Investigation 12 (2015), pp. 15–26. issn: 1742-2876. doi: http://dx.doi.org/10. 1016/j.diin.2014.11.001. url: http://www.sciencedirect.com/science/ article/pii/S1742287614001182. [36] Kamaldeep Singh et al. “Big Data Analytics framework for Peer-to-Peer Botnet detection using Random Forests”. In: Information Sciences 278 (2014), pp. 488–497. issn: 0020-0255. doi: http://dx.doi.org/10.1016/j.ins.2014.03.066. url: http://www.sciencedirect.com/science/article/pii/S0020025514003570. [37] Brett Stone-Gross et al. “The Underground Economy of Spam: A Botmaster’s Perspective of Coordinating Large-scale Spam Campaigns”. In: Proceedings of the 4th USENIX Conference on Large-scale Exploits and Emergent Threats. LEET’11. Boston, MA: USENIX Association, 2011, pp. 4–4. url: http://dl.acm.org/ citation.cfm?id=1972441.1972447. [38] Brett Stone-Gross et al. “Your Botnet is My Botnet: Analysis of a Botnet Takeover”. In: Proceedings of the 16th ACM Conference on Computer and Communications Security. CCS ’09. Chicago, Illinois, USA: ACM, 2009, pp. 635–647. isbn: 978-1- 60558-894-0. doi: 10.1145/1653662.1653738. url: http://doi.acm.org/10. 1145/1653662.1653738. [39] W. T. Strayer et al. “Detecting Botnets with Tight Command and Control”. In: Proceedings. 2006 31st IEEE Conference on Local Computer Networks. Nov. 2006, pp. 195–202. doi: 10.1109/LCN.2006.322100. [40] Florian Tegeler et al. “BotFinder: Finding Bots in Network Traffic Without Deep Packet Inspection”. In: Proceedings of the 8th International Conference on Emerging Networking Experiments and Technologies. CoNEXT ’12. Nice, France: ACM, 2012, pp. 349–360. isbn: 978-1-4503-1775-7. doi: 10.1145/2413176.2413217. url: http: //doi.acm.org/10.1145/2413176.2413217. 34