Recombinant DNA technology( Transgenic plant and animal)
Feature Selection Strategies for HTTP Botnet Traffic Detection
1. ● Authors
○ Ivan
Letteri
○ Giuseppe
Della Penna
○ Pasquale
Caianiello
Feature Selection Strategies
for
HTTP Botnet Detection
University of L’Aquila (Italy)
2. Roadmap
● Goal
○ Classify the traffic
- Classify the traffic generated by Botnet
using Machine Learning models
3. Roadmap
● Goal
○ Classify the traffic
● Develop
○ Feature extraction
and detection
- Classify the traffic generated by Botnet
using Machine Learning models
- Develop a system for the extraction of
features for the detection of malicious
traffic
4. Roadmap
● Goal
○ Classify the traffic
● Develop
○ Feature extraction
and detection
● Challenge
○ HTTP
botnet detection
- Classify the traffic generated by Botnet
using Machine Learning models
- Develop a system for the extraction of
features for the detection of malicious
traffic
- Identify the traffic generated by
camouflaged Bot within the normal
HTTP traffic
6. Related Work
● Feature Importance
○ MIFS-ND
○ mRMR
○ Max info index
● Feature Selection
○ Hoque et al.
○ Peng et al.
○ Mitra et al.
- Hoque et al.: packet size-based features, average bytes and variance of bytes per packet
- Peng et al.: length in byte, number of packets, flow duration, TCP flags, length of flow
- Mitra et al.: entropy of packet sizes
“a good data source builds … a good classifier”
7. roBOTNETwork in a nutshell
● DDoS
● Mining
Cryptocurrency
● Steal sensitive data
● Send Spam
.....
- roBOT NETwork: is a huge network of compromised devices and connected to Internet
- controlled by a single entity called the Botmaster
- for benevolent and malicious purposes
8. Botnet life-cycle
● Steps
1. Initial infection
2. Secondary Injection
- Initial Infection is the process during which the victim’s machine is compromised
- Secondary Injection the victim downloads, executes and installs a copy of the bot binary code
9. Botnet life-cycle
● Steps
1. Initial infection
2. Secondary Injection
3. Connection
4. Attack Command
5. Update &
Maintenance
- Connection the bot contacts its C&C server to announce its presence (Rallying mechanism)
- Attack Command, the botmaster send commands giving rise to attacks (DDoS, Spam, phishing, etc...)
- Botmaster: is the last step to keep the bots active and updated
10. Dataset Construction
● Raw Dataset
○ realistic traffic
○ MCFProject
- Stratosphere Project a behavioral-based intrusion detection system that uses ML
- Packet Capture file (*.pcap) format from API for capturing network traffic
- Pandas a data manipulation library highly optimized for performance
11. Dataset Construction
● Raw Dataset
○ realistic traffic
○ MCFProject
● Develop
○ Feature extraction
and detection
- Flow <Source IP, Source Port, Destination IP, Destination Port, Protocol>
- Time Windows set to 15 minutes since web sessions typically have such duration
- Filter all data not required is removed from the flow sets (e.g., UDP packets)
12. Final HTTP-botnet dataset
Raw Dataset
○ realistic traffic
○ MCFProject
Develop
○ Feature extraction
and detection
Balanced Dataset
○ 50% HTTP botnet
○ 50% Bot traffic
13. Eight Features Selected & Extracted
● 2 entropy features
○ Packet count
○ Time gap
- Entropy packet count aggregates the flows that as the same destination address
- Entropy time gap is derived as the interval between the end of the flow and the beginning of the next
14. Features based on TCP Packet Ratios
● 2 entropy features
○ Packet count
○ Time gap
● 3 TCP flow features
○ In/Out tcp pkts
○ Ratio TCP
○ OneWay TCP pkt
- I/O ratio helps to identify the communication between a bot and its C&C
- ratio TCP helps to discover DDoS botnet attacks
- OneWay ratio TCP helps to identify a larger-than-usual number of failed or half-open, one-way
15. Features based on the TCP flags
● 2 entropy features
○ Packet count
○ Time gap
● 3 TCP flow features
○ In/Out tcp pkts
○ Ratio TCP
○ OneWay TCP pkt
● 3 TCP flags features
○ SYN flag active
○ FIN flag active
○ PSH flag active
- SYN flag set, cause a SYN flood attack sending a huge number of SYN requests
- FIN flag set, cause a FIN flood attack bots send a large number of spoofed FIN packets
- PSH flag set, cause a receiver is forced to flush its buffer even if it’s not filled
16. Exploratory Data Analysis
● Scatter Matrix
○ Covariance
○ Dimensional
reduction
● Values distribution
○ Boxplot
● Correlation matrix
○ data uncertainty
- Scatter Matrix provide an estimation of covariance matrix, and in dimensionality reduction
- Boxplot captures the data distribution of the data efficiently
- Correlation Matrix useful for Mutual Information analysis to measure dependence
17. Feature Selection via 4 Decision Trees & XGBoost
● Decision Trees
○ Extra Trees
○ Gradient Boost
○ Ada Boost
○ Random Forest
● XGBoost
○ Gradient
Boosting Trees
- DecisionTrees implement feature importance with SciKit Learn library
- XGBoost algo which predict a target variable by combining the estimates of a set of weaker models
18. Feature Selection through Decision Trees
● Decision Trees
○ Extra Trees
○ Gradient Boost
○ Ada Boost
○ Random Forest
● XGBoost
○ Gradient
Boosting Trees
● Evaluation
○ Feature
Importance
average
- Select out features which we consider less relevant for HTTP botnet detection
- IOratioTcp, nTcpFinal and Hcount are removed
- nTcpPsh seems to be not so important, although only slightly less then Hcount
19. Feature Selection through Mutual Information
● Partition Information
○ partition entropy
- Information H(i) where let i be a feature and Delta i be the partition induced by i
20. Feature Selection through Mutual Information
● Partition Information
○ partition entropy
● Conditional
Information
○ conditional
entropy
- Information H(i) where let i be a feature and Delta i be the partition induced by i
- Conditional Information let i, o be features, the conditional entropy is defined as the amount of
uncertainty
21. Feature Selection through Mutual Information
● Partition Information
○ partition entropy
● Conditional
Information
○ conditional
entropy
● Mutual
Information
- Information H(i) where let i be a feature and Delta i be the partition induced by i
- Conditional Information let i, o be features, the conditional entropy is defined as the amount of ...y
- Mutual Information the uncertainty in the partition ∆i that is removed by knowing ∆o and vice-versa
22. Feature Selection through Mutual Information
Hgap
IOratioTcp
ratio_Tcp
● Partition Information
○ partition entropy
● Conditional
Information
○ conditional
entropy
● Mutual
Information
- By removing features in order of lower score one by one
- and running our MLP classifier, we get their performance metrics
23. Experimentation
● MLP
○ 8 neurons IN layer
○ 3 hidden layers
○ 1 neuron OUTlayer
● Activation function
○ ReLU
○ Sigmoid
● Loss function
○ binary cross
entropy
- Hidden layer distribution 24 (3f), 16 (2f), 8 (1f), setting the learning rate to 0.001
- binary cross entropy as the loss function, the Adam optimizer
- 150 training epochs on 70% of dataset
24. Information Relevance Score
● Classification
Accuracy & Loss
○ MLP performance
metrics
○ Progressive
removing lowest
ranked features
○ Feat. Importance
vs
Mutual Informat.
- mutual information-based technique gets almost the same accuracy of the feature importance-based
- removing the three features with lowest ranking, the accuracy is only 0.03% less than the one obtained
with the feature importance ranking, but there is a further (0.72%) reduction of the loss
26. Conclusions
● Focus
○ HTTP botnet
detection
● Experimentation
○ Feature Import.
○ Mutual Inform.
- the HTTP botnet grows year after year
- the mutual information strategy win ron the decision trees feature importance
27. Conclusions
● Focus
○ HTTP botnet
detection
● Experimentation
○ Feature import.
○ Mutual Inform.
● Results
○ Feature
relevance scores
- the HTTP botnet grows year after year
- the mutual information strategy win ron the decision trees feature importance
- how to expose the results observing accuracy and loss metrics by MLP model
28. Thank You
● for watching
● for listening
● ... your attention
- github.com/IvanLetteri/ - https:// ivanletteri.it
- www.linkedin.com/in/ivan-letteri-6516b427/