SlideShare a Scribd company logo
1 of 1
Download to read offline
Storing and Querying Large Network Traffic Datasets with Apache Hive
Mitchell Baller[1]
, Susan Urban[2]
Department of Computer Science, Case Western Reserve University [1]
Department of Industrial Engineering, Texas Tech University[2]
Texas Tech University 2015 NSF Research Experience for Undergraduates Site Program
Abstract
Today's increasing threats to cybersecurity can be
difficult to detect with traditional anti-virus software and
intrusion detection systems. Instead of screening
individual programs and transmitted data, our goal is to
detect malware and other forms of intrusion through
network attack behavior patterns. In particular, this
research is investigating the use of Apache Hive to store
and query large network file dumps. Network file dumps
are large lists of pcap (packet captures) that are exported
to csv (comma separated value) files and loaded into
Hive. HiveQL is then used to query the network traffic
and capture valuable statistics that can reveal malicious
trends in the network traffic. Queries in HiveQL have
been designed to demonstrate existing malware traffic
analysis techniques, with performance comparisons to
the filtering capabilities of tools such as Wireshark or
tshark. The objective of this research is to demonstrate
the efficiency of using big data technology such as
Apache Hive for analyzing large datasets for suspicious
activity.
Methods
•Using IBM InfoSphere Biginsights version 3.
•Running on Windows 7 Enterprise.
•Processor: Intel Core 2 Duo.
•RAM 8.00 Gb with 5.9 GB devoted to Virtual
Machine.
•Export pcap file to csv (comma separated value)
format.
•Load data into table.
•Run queries that emulate common analysis strategies.
•Compare speed of Wireshark and Hive.
•Test use of partitioning to more efficiently query data.
Future Work
•Further explore partitioning and how it increases speed
and efficiency.
•Statistical analysis on data attributes to feed into
machine learning algorithms to automatically detect
threats.
•Investigate other analysis tools.
•Compare machine learning analysis to commercial
antivirus and intrusion detection software.
References:
❑"Traffic Analysis Exercise." Malware-Traffic-Analysis. 30
June 2015. Web.
❑Sindhuri, L. B.(2015) Applying Big Data Analytics on
Integrated Cybersecurity Datasets. (Unpublished master’s
thesis). Texas Tech University, Lubbock, TX.
❑“Capture Files from Mid-Atlantic CCDC.” Netresec. 2013.
Web.
Main Objective
•Use Apache Hive to store large sets of pcap files.
•Partition data to allow for fast and efficient queries.
•Test Hive’s capacity to query relevant packets and
statistics.
•Compare with other tools commonly used to assess
network traffic data such as Wireshark and tshark.
•Demonstrate the effectiveness of partitioning.
Apache Hive
•A component of the Hadoop framework for
processing large data sets.
•Distributed storage through HDFS (Hadoop
Distributed file storage).
•Distributed processing through Map Reduce.
•Converts SQL-like queries into Map Reduce code.
DISCLAIMER: This material is based upon work supported by the National Science Foundation and the Department of Defense under Grant No. CNS-1263183. An opinions,
findings, and conclusions or recommendation expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or
the Department of Defense.
Figure 2: Screenshot of Wireshark
Figure 1: Screenshot of Apache Hive
Common Analysis Strategies
•Filter by http.request, then look through hosts to find
suspicious traffic.
•Find traffic from all tcp ports but 80 with the SYN flag
set.
•Search for any external NETBIOS traffic.
Results
•The overhead associated with using the Virtual
Machine as well as Mapping and Reducing made queries
with small datasets slower in Hive than Wireshark.
•However, Wireshark could not run queries on datasets
much larger than 5 million packets.
•Even at 5 million packets, Hive was faster than
Wireshark.
•Hive has the capacity to continue adding packets while
maintaining a linear increase in query time. (figure 4)
•More performance can be shown when partitioning is
implemented (figure 3).
•Furthermore, Hive can easily add cores from
commodity hardware to improve performance further.
Figure 4 (left) compares
the performance of
Apache Hive to Wireshark
as the number of packets
processed increases.
Figure 3 (above) Is a screenshot demonstrating the
performance increase associated with partitioning
total packets
Hive Load
Time Query 1 avg Query 2 avg Query 3 avg Hive Query Average Wireshark Packets Wireshark Average
0 4.6 54.4666667 45.8666667 33.4333333 44.58888889 0 0
3679916 19.7 70.3333333 69.2666667 67.05 68.88333333 3679916 65.25
5753004 29.5 93 86.2333333 86.1666667 88.46666667 2073088 39.5
7068211 38.8 92.4666667 90.5 93.1333333 92.03333333 1315207 29
8006273 41.2 107.266667 107.233333 107.233333 107.2444444 938062 19.5
9493847 47.8 115.733333 113.9 117.1 115.5777778 1487574 29.75
9849142 51.8 131.866667 133.5 115.3 126.8888889 355295 8.75
11306882 54.8 144.6 121.033333 150.9 138.8444444 1457740 35.25
13188523 80.3 141.533333 159.3 140.366667 147.0666667 1881641 54.75
16547769 79.2 197.4 182.266667 198.833333 192.8333333 3359246 77
18815952 100.2 218.433333 243.133333 219.8 227.1222222 2268183 51.5
21424373 105.3 265.033333 238.866667 275.6 259.8333333 2608421 71.25
21646375 104.6 245.766667 274.066667 246.866667 255.5666667 222002 3.25
26297364 118.3 335.2 294.8 288.366667 306.1222222 4650989 93
31728102 154.5 334.666667 328.733333 391.65 351.6833333 5430738 106.5
35286631 169.4 318.133333 292.166667 296.233333 302.1777778 3558529 70
Figure 5 (right) is the data
used to plot figure 4.
Queries Used
• Query 1 >SELECT COUNT(*) FROM final WHERE info LIKE ‘%GET%’;
• Query 2 >SELECT COUNT(*) FROM final WHERE srcPort <> 80 AND flag LIKE ‘%0x0002%’;
• Query 3> SELECT COUNT (*) FROM final WHERE info LIKE ‘%NBSTAT%’;
Figure 6: Illustration of Apache Hadoop

More Related Content

What's hot

Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Research Automation for Data-Driven Discovery
Research Automationfor Data-Driven DiscoveryResearch Automationfor Data-Driven Discovery
Research Automation for Data-Driven DiscoveryGlobus
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Robert Grossman
 
Social Media Analytics on Canadian Airlines
Social Media Analytics on Canadian AirlinesSocial Media Analytics on Canadian Airlines
Social Media Analytics on Canadian AirlinesBernardo Najlis
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
 
balloon: LOD forecasting - cloudy with a chance of services
balloon: LOD forecasting - cloudy with a chance of servicesballoon: LOD forecasting - cloudy with a chance of services
balloon: LOD forecasting - cloudy with a chance of servicesKai Schlegel
 
Browser Extension to Removing Dust using Sequence Alignment and Content Matc...
Browser Extension to Removing Dust using Sequence Alignment and Content  Matc...Browser Extension to Removing Dust using Sequence Alignment and Content  Matc...
Browser Extension to Removing Dust using Sequence Alignment and Content Matc...IRJET Journal
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)Robert Grossman
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
Optique presentation
Optique presentationOptique presentation
Optique presentationDBOnto
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesBenchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesTanu Malik
 
ResourceSync Introduction at SWIB13
ResourceSync Introduction at SWIB13ResourceSync Introduction at SWIB13
ResourceSync Introduction at SWIB13Simeon Warner
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualizationbigdataviz_bay
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Robert Grossman
 
GlobusWorld 2015
GlobusWorld 2015GlobusWorld 2015
GlobusWorld 2015Tanu Malik
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesJason Hattrick-Simpers
 
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...AshishDPatel1
 

What's hot (20)

Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Survey on NoSQL integration
Survey on NoSQL integrationSurvey on NoSQL integration
Survey on NoSQL integration
 
Research Automation for Data-Driven Discovery
Research Automationfor Data-Driven DiscoveryResearch Automationfor Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)
 
Social Media Analytics on Canadian Airlines
Social Media Analytics on Canadian AirlinesSocial Media Analytics on Canadian Airlines
Social Media Analytics on Canadian Airlines
 
Scientific Publication Retrieval in Linked Data
Scientific Publication Retrieval in Linked DataScientific Publication Retrieval in Linked Data
Scientific Publication Retrieval in Linked Data
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
balloon: LOD forecasting - cloudy with a chance of services
balloon: LOD forecasting - cloudy with a chance of servicesballoon: LOD forecasting - cloudy with a chance of services
balloon: LOD forecasting - cloudy with a chance of services
 
Browser Extension to Removing Dust using Sequence Alignment and Content Matc...
Browser Extension to Removing Dust using Sequence Alignment and Content  Matc...Browser Extension to Removing Dust using Sequence Alignment and Content  Matc...
Browser Extension to Removing Dust using Sequence Alignment and Content Matc...
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Optique presentation
Optique presentationOptique presentation
Optique presentation
 
ML in materials discovery
ML in materials discovery ML in materials discovery
ML in materials discovery
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesBenchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging Services
 
ResourceSync Introduction at SWIB13
ResourceSync Introduction at SWIB13ResourceSync Introduction at SWIB13
ResourceSync Introduction at SWIB13
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)
 
GlobusWorld 2015
GlobusWorld 2015GlobusWorld 2015
GlobusWorld 2015
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop Slides
 
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
 

Similar to poster draft 5

Query optimization techniques in Apache Hive
Query optimization techniques in Apache Hive Query optimization techniques in Apache Hive
Query optimization techniques in Apache Hive Zara Tariq
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidDataWorks Summit
 
A cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataA cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataredpel dot com
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java ProjectsVijay Karan
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java ProjectsVijay Karan
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Otávio Carvalho
 
Data munging and analysis
Data munging and analysisData munging and analysis
Data munging and analysisRaminder Singh
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYAAditya Srinivasan
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...Alex Zeltov
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at YorkMing Li
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Milos Milovanovic
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesVasu S
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Darko Marjanovic
 

Similar to poster draft 5 (20)

Query optimization techniques in Apache Hive
Query optimization techniques in Apache Hive Query optimization techniques in Apache Hive
Query optimization techniques in Apache Hive
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and Druid
 
A cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataA cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring data
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java Projects
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java Projects
 
Paper ijert
Paper ijertPaper ijert
Paper ijert
 
Data mining weka
Data mining wekaData mining weka
Data mining weka
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
 
Data munging and analysis
Data munging and analysisData munging and analysis
Data munging and analysis
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
 
50120140504006
5012014050400650120140504006
50120140504006
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at York
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data Lakes
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 

poster draft 5

  • 1. Storing and Querying Large Network Traffic Datasets with Apache Hive Mitchell Baller[1] , Susan Urban[2] Department of Computer Science, Case Western Reserve University [1] Department of Industrial Engineering, Texas Tech University[2] Texas Tech University 2015 NSF Research Experience for Undergraduates Site Program Abstract Today's increasing threats to cybersecurity can be difficult to detect with traditional anti-virus software and intrusion detection systems. Instead of screening individual programs and transmitted data, our goal is to detect malware and other forms of intrusion through network attack behavior patterns. In particular, this research is investigating the use of Apache Hive to store and query large network file dumps. Network file dumps are large lists of pcap (packet captures) that are exported to csv (comma separated value) files and loaded into Hive. HiveQL is then used to query the network traffic and capture valuable statistics that can reveal malicious trends in the network traffic. Queries in HiveQL have been designed to demonstrate existing malware traffic analysis techniques, with performance comparisons to the filtering capabilities of tools such as Wireshark or tshark. The objective of this research is to demonstrate the efficiency of using big data technology such as Apache Hive for analyzing large datasets for suspicious activity. Methods •Using IBM InfoSphere Biginsights version 3. •Running on Windows 7 Enterprise. •Processor: Intel Core 2 Duo. •RAM 8.00 Gb with 5.9 GB devoted to Virtual Machine. •Export pcap file to csv (comma separated value) format. •Load data into table. •Run queries that emulate common analysis strategies. •Compare speed of Wireshark and Hive. •Test use of partitioning to more efficiently query data. Future Work •Further explore partitioning and how it increases speed and efficiency. •Statistical analysis on data attributes to feed into machine learning algorithms to automatically detect threats. •Investigate other analysis tools. •Compare machine learning analysis to commercial antivirus and intrusion detection software. References: ❑"Traffic Analysis Exercise." Malware-Traffic-Analysis. 30 June 2015. Web. ❑Sindhuri, L. B.(2015) Applying Big Data Analytics on Integrated Cybersecurity Datasets. (Unpublished master’s thesis). Texas Tech University, Lubbock, TX. ❑“Capture Files from Mid-Atlantic CCDC.” Netresec. 2013. Web. Main Objective •Use Apache Hive to store large sets of pcap files. •Partition data to allow for fast and efficient queries. •Test Hive’s capacity to query relevant packets and statistics. •Compare with other tools commonly used to assess network traffic data such as Wireshark and tshark. •Demonstrate the effectiveness of partitioning. Apache Hive •A component of the Hadoop framework for processing large data sets. •Distributed storage through HDFS (Hadoop Distributed file storage). •Distributed processing through Map Reduce. •Converts SQL-like queries into Map Reduce code. DISCLAIMER: This material is based upon work supported by the National Science Foundation and the Department of Defense under Grant No. CNS-1263183. An opinions, findings, and conclusions or recommendation expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the Department of Defense. Figure 2: Screenshot of Wireshark Figure 1: Screenshot of Apache Hive Common Analysis Strategies •Filter by http.request, then look through hosts to find suspicious traffic. •Find traffic from all tcp ports but 80 with the SYN flag set. •Search for any external NETBIOS traffic. Results •The overhead associated with using the Virtual Machine as well as Mapping and Reducing made queries with small datasets slower in Hive than Wireshark. •However, Wireshark could not run queries on datasets much larger than 5 million packets. •Even at 5 million packets, Hive was faster than Wireshark. •Hive has the capacity to continue adding packets while maintaining a linear increase in query time. (figure 4) •More performance can be shown when partitioning is implemented (figure 3). •Furthermore, Hive can easily add cores from commodity hardware to improve performance further. Figure 4 (left) compares the performance of Apache Hive to Wireshark as the number of packets processed increases. Figure 3 (above) Is a screenshot demonstrating the performance increase associated with partitioning total packets Hive Load Time Query 1 avg Query 2 avg Query 3 avg Hive Query Average Wireshark Packets Wireshark Average 0 4.6 54.4666667 45.8666667 33.4333333 44.58888889 0 0 3679916 19.7 70.3333333 69.2666667 67.05 68.88333333 3679916 65.25 5753004 29.5 93 86.2333333 86.1666667 88.46666667 2073088 39.5 7068211 38.8 92.4666667 90.5 93.1333333 92.03333333 1315207 29 8006273 41.2 107.266667 107.233333 107.233333 107.2444444 938062 19.5 9493847 47.8 115.733333 113.9 117.1 115.5777778 1487574 29.75 9849142 51.8 131.866667 133.5 115.3 126.8888889 355295 8.75 11306882 54.8 144.6 121.033333 150.9 138.8444444 1457740 35.25 13188523 80.3 141.533333 159.3 140.366667 147.0666667 1881641 54.75 16547769 79.2 197.4 182.266667 198.833333 192.8333333 3359246 77 18815952 100.2 218.433333 243.133333 219.8 227.1222222 2268183 51.5 21424373 105.3 265.033333 238.866667 275.6 259.8333333 2608421 71.25 21646375 104.6 245.766667 274.066667 246.866667 255.5666667 222002 3.25 26297364 118.3 335.2 294.8 288.366667 306.1222222 4650989 93 31728102 154.5 334.666667 328.733333 391.65 351.6833333 5430738 106.5 35286631 169.4 318.133333 292.166667 296.233333 302.1777778 3558529 70 Figure 5 (right) is the data used to plot figure 4. Queries Used • Query 1 >SELECT COUNT(*) FROM final WHERE info LIKE ‘%GET%’; • Query 2 >SELECT COUNT(*) FROM final WHERE srcPort <> 80 AND flag LIKE ‘%0x0002%’; • Query 3> SELECT COUNT (*) FROM final WHERE info LIKE ‘%NBSTAT%’; Figure 6: Illustration of Apache Hadoop