1. Storing and Querying Large Network Traffic Datasets with Apache Hive
Mitchell Baller[1]
, Susan Urban[2]
Department of Computer Science, Case Western Reserve University [1]
Department of Industrial Engineering, Texas Tech University[2]
Texas Tech University 2015 NSF Research Experience for Undergraduates Site Program
Abstract
Today's increasing threats to cybersecurity can be
difficult to detect with traditional anti-virus software and
intrusion detection systems. Instead of screening
individual programs and transmitted data, our goal is to
detect malware and other forms of intrusion through
network attack behavior patterns. In particular, this
research is investigating the use of Apache Hive to store
and query large network file dumps. Network file dumps
are large lists of pcap (packet captures) that are exported
to csv (comma separated value) files and loaded into
Hive. HiveQL is then used to query the network traffic
and capture valuable statistics that can reveal malicious
trends in the network traffic. Queries in HiveQL have
been designed to demonstrate existing malware traffic
analysis techniques, with performance comparisons to
the filtering capabilities of tools such as Wireshark or
tshark. The objective of this research is to demonstrate
the efficiency of using big data technology such as
Apache Hive for analyzing large datasets for suspicious
activity.
Methods
•Using IBM InfoSphere Biginsights version 3.
•Running on Windows 7 Enterprise.
•Processor: Intel Core 2 Duo.
•RAM 8.00 Gb with 5.9 GB devoted to Virtual
Machine.
•Export pcap file to csv (comma separated value)
format.
•Load data into table.
•Run queries that emulate common analysis strategies.
•Compare speed of Wireshark and Hive.
•Test use of partitioning to more efficiently query data.
Future Work
•Further explore partitioning and how it increases speed
and efficiency.
•Statistical analysis on data attributes to feed into
machine learning algorithms to automatically detect
threats.
•Investigate other analysis tools.
•Compare machine learning analysis to commercial
antivirus and intrusion detection software.
References:
❑"Traffic Analysis Exercise." Malware-Traffic-Analysis. 30
June 2015. Web.
❑Sindhuri, L. B.(2015) Applying Big Data Analytics on
Integrated Cybersecurity Datasets. (Unpublished master’s
thesis). Texas Tech University, Lubbock, TX.
❑“Capture Files from Mid-Atlantic CCDC.” Netresec. 2013.
Web.
Main Objective
•Use Apache Hive to store large sets of pcap files.
•Partition data to allow for fast and efficient queries.
•Test Hive’s capacity to query relevant packets and
statistics.
•Compare with other tools commonly used to assess
network traffic data such as Wireshark and tshark.
•Demonstrate the effectiveness of partitioning.
Apache Hive
•A component of the Hadoop framework for
processing large data sets.
•Distributed storage through HDFS (Hadoop
Distributed file storage).
•Distributed processing through Map Reduce.
•Converts SQL-like queries into Map Reduce code.
DISCLAIMER: This material is based upon work supported by the National Science Foundation and the Department of Defense under Grant No. CNS-1263183. An opinions,
findings, and conclusions or recommendation expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or
the Department of Defense.
Figure 2: Screenshot of Wireshark
Figure 1: Screenshot of Apache Hive
Common Analysis Strategies
•Filter by http.request, then look through hosts to find
suspicious traffic.
•Find traffic from all tcp ports but 80 with the SYN flag
set.
•Search for any external NETBIOS traffic.
Results
•The overhead associated with using the Virtual
Machine as well as Mapping and Reducing made queries
with small datasets slower in Hive than Wireshark.
•However, Wireshark could not run queries on datasets
much larger than 5 million packets.
•Even at 5 million packets, Hive was faster than
Wireshark.
•Hive has the capacity to continue adding packets while
maintaining a linear increase in query time. (figure 4)
•More performance can be shown when partitioning is
implemented (figure 3).
•Furthermore, Hive can easily add cores from
commodity hardware to improve performance further.
Figure 4 (left) compares
the performance of
Apache Hive to Wireshark
as the number of packets
processed increases.
Figure 3 (above) Is a screenshot demonstrating the
performance increase associated with partitioning
total packets
Hive Load
Time Query 1 avg Query 2 avg Query 3 avg Hive Query Average Wireshark Packets Wireshark Average
0 4.6 54.4666667 45.8666667 33.4333333 44.58888889 0 0
3679916 19.7 70.3333333 69.2666667 67.05 68.88333333 3679916 65.25
5753004 29.5 93 86.2333333 86.1666667 88.46666667 2073088 39.5
7068211 38.8 92.4666667 90.5 93.1333333 92.03333333 1315207 29
8006273 41.2 107.266667 107.233333 107.233333 107.2444444 938062 19.5
9493847 47.8 115.733333 113.9 117.1 115.5777778 1487574 29.75
9849142 51.8 131.866667 133.5 115.3 126.8888889 355295 8.75
11306882 54.8 144.6 121.033333 150.9 138.8444444 1457740 35.25
13188523 80.3 141.533333 159.3 140.366667 147.0666667 1881641 54.75
16547769 79.2 197.4 182.266667 198.833333 192.8333333 3359246 77
18815952 100.2 218.433333 243.133333 219.8 227.1222222 2268183 51.5
21424373 105.3 265.033333 238.866667 275.6 259.8333333 2608421 71.25
21646375 104.6 245.766667 274.066667 246.866667 255.5666667 222002 3.25
26297364 118.3 335.2 294.8 288.366667 306.1222222 4650989 93
31728102 154.5 334.666667 328.733333 391.65 351.6833333 5430738 106.5
35286631 169.4 318.133333 292.166667 296.233333 302.1777778 3558529 70
Figure 5 (right) is the data
used to plot figure 4.
Queries Used
• Query 1 >SELECT COUNT(*) FROM final WHERE info LIKE ‘%GET%’;
• Query 2 >SELECT COUNT(*) FROM final WHERE srcPort <> 80 AND flag LIKE ‘%0x0002%’;
• Query 3> SELECT COUNT (*) FROM final WHERE info LIKE ‘%NBSTAT%’;
Figure 6: Illustration of Apache Hadoop