This document discusses using Apache Hive to analyze large network traffic datasets. It loads packet capture (pcap) files into Hive tables and runs queries to detect malware and intrusion patterns. The research aims to demonstrate Hive's efficiency for querying large datasets and detecting suspicious network activity. Initial results show that while Hive queries small datasets slower than Wireshark, it can efficiently query much larger datasets that Wireshark cannot handle. Partitioning the data in Hive further improves query speeds.
Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others.
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Building Web APIs on top of SPARQL endpoints is becoming common practice. It enables universal access to the integration favorable data space of Linked Data. In the majority of use cases, users cannot be expected to learn SPARQL to query this data space. Web APIs are the most common way to enable programmatic access to data on the Web. However, the implementation of Web APIs around Linked Data is often a tedious and repetitive process. Recent work speeds up this Linked Data API construction by wrapping it around SPARQL queries, which carry out the API functionality under the hood. Inspired by this development, in this paper we present grlc, a lightweight server that takes SPARQL queries curated in GitHub repositories, and translates them to Linked Data APIs on the fly.
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program integrates advances in biology, chemistry, and computer science to help prioritize chemicals for further research based on potential human health risks. This involves computational and data-driven approaches that integrate chemistry, exposure and biological data. The National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences, including high-throughput in vitro screening data, in vivo and functional use data, exposure models and chemical databases with associated properties. The CompTox Chemicals Dashboard is a web-based application providing access to data associated with ~875,000 chemical substances. New data are continuously added to the database on an ongoing basis, along with registration of new and emerging chemicals. This includes data extracted from the literature, identified by our analytical labs, and otherwise of interest to support specific research projects to the agency. By adding these data, with their associated chemical identifiers (names and CAS Registry Numbers), the dashboard uses linking approaches to allow for automated searching of PubMed, Google Scholar and an array of public databases. This presentation will provide an overview of the CompTox Chemicals Dashboard, how it has developed into an integrated data hub for environmental data, and how it can be used for the analysis of emerging chemicals in terms of sourcing related chemicals of interest, and deriving read-across as well as QSAR predictions in real time. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others.
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Building Web APIs on top of SPARQL endpoints is becoming common practice. It enables universal access to the integration favorable data space of Linked Data. In the majority of use cases, users cannot be expected to learn SPARQL to query this data space. Web APIs are the most common way to enable programmatic access to data on the Web. However, the implementation of Web APIs around Linked Data is often a tedious and repetitive process. Recent work speeds up this Linked Data API construction by wrapping it around SPARQL queries, which carry out the API functionality under the hood. Inspired by this development, in this paper we present grlc, a lightweight server that takes SPARQL queries curated in GitHub repositories, and translates them to Linked Data APIs on the fly.
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program integrates advances in biology, chemistry, and computer science to help prioritize chemicals for further research based on potential human health risks. This involves computational and data-driven approaches that integrate chemistry, exposure and biological data. The National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences, including high-throughput in vitro screening data, in vivo and functional use data, exposure models and chemical databases with associated properties. The CompTox Chemicals Dashboard is a web-based application providing access to data associated with ~875,000 chemical substances. New data are continuously added to the database on an ongoing basis, along with registration of new and emerging chemicals. This includes data extracted from the literature, identified by our analytical labs, and otherwise of interest to support specific research projects to the agency. By adding these data, with their associated chemical identifiers (names and CAS Registry Numbers), the dashboard uses linking approaches to allow for automated searching of PubMed, Google Scholar and an array of public databases. This presentation will provide an overview of the CompTox Chemicals Dashboard, how it has developed into an integrated data hub for environmental data, and how it can be used for the analysis of emerging chemicals in terms of sourcing related chemicals of interest, and deriving read-across as well as QSAR predictions in real time. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
A research project using Twitter on Canadian Airlines: Topic Modelling, Sentiment Detection and Network Analytics. Presented on April 5th, 2017 in Toronto as part of course DS8006 - Social Media Analytics at Ryerson Masters in Data Science and Analytics.
Presentation held by Lim Ying Sean, Arun Anand Sadanandan, Dickson Lukose and Klaus Tochtermann at the Agricultural Ontology Service (AOS) Workshop 2012 in Kutching, Sarawak, Malaysia from September 3 - 4, 2012
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
Optique - to provide semantic end-to-end connection between users and data sources; enable users to rapidly formulate intuitive queries using familiar vocabularies and conceptualisations and return timely answers from large scale and heterogeneous data sources.
Pesented at SWIB13 in Hamburg, 2013-11-27. ResourceSync slides excerpted from the full tutorial at http://www.slideshare.net/OpenArchivesInitiative/resourcesync-tutorial
Big Data Visualization
Kwan-Liu Ma
Professor of Computer Science and Chair of the Graduate Group in Computer Science (GGCS) at the University of California-Davis
January 22nd 2014
We are entering a data-rich era. Advanced computing, imaging, and sensing technologies enable scientists to study natural and physical phenomena at unprecedented precision, resulting in an explosive growth of data. The size of the collected information about the Web and mobile device users is expected to be even greater. To make sense and maximize utilization of such vast amounts of data for knowledge discovery and decision making, we need a new set of tools beyond conventional data mining and statistical analysis. One such a tool is visualization. I will present visualizations designed for gleaning insight from massive data and guiding complex data analysis tasks. I will show case studies using data from cyber/homeland security, large-scale scientific simulations, medicine, and sociological studies.
Big Data Visualization Meetup - South Bay
http://www.meetup.com/Big-Data-Visualisation-South-Bay/
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...AshishDPatel1
The sequential pattern mining generates the sequential patterns. It can be used as the input of another program for retrieving the information from the large collection of data. It requires a large amount of memory as well as numerous I/O operations. Multistage operations reduce the efficiency of the
algorithm. The given GACP is based on graph representation and avoids recursively reconstructing intermediate trees during the mining process. The algorithm also eliminates the need of repeatedly scanning the database. A graph used in GACP is a data structure accessed starting at its first node called root and each node of a graph is either a leaf or an interior node. An interior node has one or more child nodes, thus from the root to any node in the graph defines a sequence. After construction of the graph the pruning technique called clustering is used to retrieve the records from the graph. The algorithm can be used to mine the database using compact memory based data structures and cleaver pruning methods.
Query optimization techniques in Apache Hive Zara Tariq
In today’s world, data management has become more complex and essential part of every industry, organization and individual to function in order with the existing and continuously growing data. All type of organizations are witnessing an exponential growth in there massive volume of data and also demands for immediate results and readily available metrics in a cost-effective manner to take strategic decisions. This paper will explore most effective compression technique to optimize the query processing response time with low latency rate among different techniques that are recently introduced in Hive by testing a set of interactive queries on a single cluster of Big Data.
According to our results extracted out from this research, depicts that by applying MapReduce with ORCSerde formatted data, it can speed up the performance improvement with low latencies up to 28X, which makes this combination most effective method of optimizing interactive queries on this type of dataset. Conversely, MapReduce is the most efficient independent technique which has successfully completed overall Hive jobs, on average latency rate of 20X faster than Hive default execution engine.
Lightning Fast Analytics with Hive LLAP and DruidDataWorks Summit
Cox Communications, one of the largest network providers in the U.S., is primarily focused on ensuring network security and providing better service to customers including:
• Real-time monitoring of IP security traffic to identify and alert the unusual network activities across interfaces within an organization
• Enrich the security team with capabilities to determine the source and destination of traffic, class of service, and the causes of congestion on NetFlow data
Challenges:
Data related to Network Security includes more granular streaming data. The major challenge lies in having an unified platform to perform data cleansing, transformation, analytics and reporting on this huge streaming datasets. With the growing network traffic, there is an exponential growth with the associated data. There is a need for Scalable framework to handle these datasets and derive useful information out of data. Along with data processing, data retrieval also plays a major role for better analysis. Currently Data processing was done in daily batch using manual python scripts and with implementation of custom data structures which were specific to use cases. There was a need for more generic and unified framework to provide automated real time end to end solution to obtain high performing, more granular business results.
Solution:
Automation of this process has opportunities on several fronts, notably, providing consistency, repeat-ability, and modernization of OLAP analytics on enterprise big data platform. Reports can be generated easier and faster with the underlying OLAP engine.
• Modern Big Data Platform provides the necessary tool and infrastructure to land, cleanse, process Real time stream data processing and enriching data using the ecosystem components like Spark, Kafka, Hive
• Impressively faster OLAP analytics using Hive LLAP and Druid Integration
• Simple and faster reporting using Superset
All of the necessary components under one roof of Hortonworks Hadoop Platform.
An end-to-end solution using Big Data platform produced faster and repeatable results with sub second query results.
Value Additions by above solution:
• Deliver ultra-fast SQL analytics that can be consumed from the BI tool by security engineering team to get accelerated business results
• Opportunity for business users to explore and visualize real time streaming datasets with integration for various data sources and build dashboards for different slices
• Capability to run BI queries in just milliseconds over 1TB dataset
• High granular permission model on security datasets that allow intricate rules on accessibility for the datasets
A research project using Twitter on Canadian Airlines: Topic Modelling, Sentiment Detection and Network Analytics. Presented on April 5th, 2017 in Toronto as part of course DS8006 - Social Media Analytics at Ryerson Masters in Data Science and Analytics.
Presentation held by Lim Ying Sean, Arun Anand Sadanandan, Dickson Lukose and Klaus Tochtermann at the Agricultural Ontology Service (AOS) Workshop 2012 in Kutching, Sarawak, Malaysia from September 3 - 4, 2012
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
Optique - to provide semantic end-to-end connection between users and data sources; enable users to rapidly formulate intuitive queries using familiar vocabularies and conceptualisations and return timely answers from large scale and heterogeneous data sources.
Pesented at SWIB13 in Hamburg, 2013-11-27. ResourceSync slides excerpted from the full tutorial at http://www.slideshare.net/OpenArchivesInitiative/resourcesync-tutorial
Big Data Visualization
Kwan-Liu Ma
Professor of Computer Science and Chair of the Graduate Group in Computer Science (GGCS) at the University of California-Davis
January 22nd 2014
We are entering a data-rich era. Advanced computing, imaging, and sensing technologies enable scientists to study natural and physical phenomena at unprecedented precision, resulting in an explosive growth of data. The size of the collected information about the Web and mobile device users is expected to be even greater. To make sense and maximize utilization of such vast amounts of data for knowledge discovery and decision making, we need a new set of tools beyond conventional data mining and statistical analysis. One such a tool is visualization. I will present visualizations designed for gleaning insight from massive data and guiding complex data analysis tasks. I will show case studies using data from cyber/homeland security, large-scale scientific simulations, medicine, and sociological studies.
Big Data Visualization Meetup - South Bay
http://www.meetup.com/Big-Data-Visualisation-South-Bay/
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...AshishDPatel1
The sequential pattern mining generates the sequential patterns. It can be used as the input of another program for retrieving the information from the large collection of data. It requires a large amount of memory as well as numerous I/O operations. Multistage operations reduce the efficiency of the
algorithm. The given GACP is based on graph representation and avoids recursively reconstructing intermediate trees during the mining process. The algorithm also eliminates the need of repeatedly scanning the database. A graph used in GACP is a data structure accessed starting at its first node called root and each node of a graph is either a leaf or an interior node. An interior node has one or more child nodes, thus from the root to any node in the graph defines a sequence. After construction of the graph the pruning technique called clustering is used to retrieve the records from the graph. The algorithm can be used to mine the database using compact memory based data structures and cleaver pruning methods.
Query optimization techniques in Apache Hive Zara Tariq
In today’s world, data management has become more complex and essential part of every industry, organization and individual to function in order with the existing and continuously growing data. All type of organizations are witnessing an exponential growth in there massive volume of data and also demands for immediate results and readily available metrics in a cost-effective manner to take strategic decisions. This paper will explore most effective compression technique to optimize the query processing response time with low latency rate among different techniques that are recently introduced in Hive by testing a set of interactive queries on a single cluster of Big Data.
According to our results extracted out from this research, depicts that by applying MapReduce with ORCSerde formatted data, it can speed up the performance improvement with low latencies up to 28X, which makes this combination most effective method of optimizing interactive queries on this type of dataset. Conversely, MapReduce is the most efficient independent technique which has successfully completed overall Hive jobs, on average latency rate of 20X faster than Hive default execution engine.
Lightning Fast Analytics with Hive LLAP and DruidDataWorks Summit
Cox Communications, one of the largest network providers in the U.S., is primarily focused on ensuring network security and providing better service to customers including:
• Real-time monitoring of IP security traffic to identify and alert the unusual network activities across interfaces within an organization
• Enrich the security team with capabilities to determine the source and destination of traffic, class of service, and the causes of congestion on NetFlow data
Challenges:
Data related to Network Security includes more granular streaming data. The major challenge lies in having an unified platform to perform data cleansing, transformation, analytics and reporting on this huge streaming datasets. With the growing network traffic, there is an exponential growth with the associated data. There is a need for Scalable framework to handle these datasets and derive useful information out of data. Along with data processing, data retrieval also plays a major role for better analysis. Currently Data processing was done in daily batch using manual python scripts and with implementation of custom data structures which were specific to use cases. There was a need for more generic and unified framework to provide automated real time end to end solution to obtain high performing, more granular business results.
Solution:
Automation of this process has opportunities on several fronts, notably, providing consistency, repeat-ability, and modernization of OLAP analytics on enterprise big data platform. Reports can be generated easier and faster with the underlying OLAP engine.
• Modern Big Data Platform provides the necessary tool and infrastructure to land, cleanse, process Real time stream data processing and enriching data using the ecosystem components like Spark, Kafka, Hive
• Impressively faster OLAP analytics using Hive LLAP and Druid Integration
• Simple and faster reporting using Superset
All of the necessary components under one roof of Hortonworks Hadoop Platform.
An end-to-end solution using Big Data platform produced faster and repeatable results with sub second query results.
Value Additions by above solution:
• Deliver ultra-fast SQL analytics that can be consumed from the BI tool by security engineering team to get accelerated business results
• Opportunity for business users to explore and visualize real time streaming datasets with integration for various data sources and build dashboards for different slices
• Capability to run BI queries in just milliseconds over 1TB dataset
• High granular permission model on security datasets that allow intricate rules on accessibility for the datasets
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
This session covers 9 new and exciting big data technologies that are starting to become relevant in the enterprise. The session focuses on technologies that are still not mainstream but that have the potential to influence the next generation of enterprise big data solutions
Big data serving: Processing and inference at scale in real timeItai Yaffe
Jon Bratseth (VP Architect) @ Verizon Media:
The big data world has mature technologies for offline analysis and learning from data, but have lacked options for making data-driven decisions in real time.
When it is sufficient to consider a single data point model servers such as TensorFlow serving can be used but in many cases you want to consider many data points to make decisions.
This is a difficult engineering problem combining state, distributed algorithms and low latency, but solving it often makes it possible to create far superior solutions when applying machine learning.
This talk will explain why this is a hard problem, show the advantages of solving it, and introduce the open source Vespa.ai platform which is used to implement such solutions in some of the largest scale problems in the world including the world's third largest ad serving system.
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Otávio Carvalho
Work presented in partial fulfillment
of the requirements for the degree of
Bachelor in Computer Science - Federal University of Rio Grande do - Brazil
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
Agile, Continous Intergration, DevOps, Big data are not longer buzzwords but part of the day today process of everyone working in software development and delivery. To cope with applications that need to be deployed in production almost the same moment they were created, software development has changed, impacting the way of working for everyone in the team. In this talk, Roland will discuss the challenges performance testers face with Big Data applications and how Architecture, Agile, Continous Intergration and DevOps come together to create solutions.
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
Abstract— In data mining, association rule mining is one of the major techniques for discovering meaningful patterns from large collection of data. Discovering frequent item sets play an important role in mining association rules, sequence rules, web log mining and many other interesting patterns surrounded by complex data. Frequent Item set Mining is one of the classical data mining tribulations in most of the data mining applications. Apache Hadoop is a major innovation in the IT market place last decade. From modest beginnings Apache Hadoop has become a world-wide adoption in data centers. It brings parallel processing in hands of average programmer. This paper presents a literature analysis on different techniques for mining frequent item sets and frequent item sets on Hadoop.
Hadoop and Internet of Things presentation from Sinergija 2014 conference, held in Belgrade in October 2014. How the rising data resources change the business, and how the Big Data technologies combined with Internet of Things devices can help to improve the business and the everyday life. Hadoop is already the most significant technology for working with Big Data. Microsoft is playing a very important role in this field, with the Stinger initiative. The main goal is to bring the enterprise SQL at Hadoop scale.
A whitepaper from qubole about the Tips on how to choose the best SQL Engine for your use case and data workloads
https://www.qubole.com/resources/white-papers/enabling-sql-access-to-data-lakes
1. Storing and Querying Large Network Traffic Datasets with Apache Hive
Mitchell Baller[1]
, Susan Urban[2]
Department of Computer Science, Case Western Reserve University [1]
Department of Industrial Engineering, Texas Tech University[2]
Texas Tech University 2015 NSF Research Experience for Undergraduates Site Program
Abstract
Today's increasing threats to cybersecurity can be
difficult to detect with traditional anti-virus software and
intrusion detection systems. Instead of screening
individual programs and transmitted data, our goal is to
detect malware and other forms of intrusion through
network attack behavior patterns. In particular, this
research is investigating the use of Apache Hive to store
and query large network file dumps. Network file dumps
are large lists of pcap (packet captures) that are exported
to csv (comma separated value) files and loaded into
Hive. HiveQL is then used to query the network traffic
and capture valuable statistics that can reveal malicious
trends in the network traffic. Queries in HiveQL have
been designed to demonstrate existing malware traffic
analysis techniques, with performance comparisons to
the filtering capabilities of tools such as Wireshark or
tshark. The objective of this research is to demonstrate
the efficiency of using big data technology such as
Apache Hive for analyzing large datasets for suspicious
activity.
Methods
•Using IBM InfoSphere Biginsights version 3.
•Running on Windows 7 Enterprise.
•Processor: Intel Core 2 Duo.
•RAM 8.00 Gb with 5.9 GB devoted to Virtual
Machine.
•Export pcap file to csv (comma separated value)
format.
•Load data into table.
•Run queries that emulate common analysis strategies.
•Compare speed of Wireshark and Hive.
•Test use of partitioning to more efficiently query data.
Future Work
•Further explore partitioning and how it increases speed
and efficiency.
•Statistical analysis on data attributes to feed into
machine learning algorithms to automatically detect
threats.
•Investigate other analysis tools.
•Compare machine learning analysis to commercial
antivirus and intrusion detection software.
References:
❑"Traffic Analysis Exercise." Malware-Traffic-Analysis. 30
June 2015. Web.
❑Sindhuri, L. B.(2015) Applying Big Data Analytics on
Integrated Cybersecurity Datasets. (Unpublished master’s
thesis). Texas Tech University, Lubbock, TX.
❑“Capture Files from Mid-Atlantic CCDC.” Netresec. 2013.
Web.
Main Objective
•Use Apache Hive to store large sets of pcap files.
•Partition data to allow for fast and efficient queries.
•Test Hive’s capacity to query relevant packets and
statistics.
•Compare with other tools commonly used to assess
network traffic data such as Wireshark and tshark.
•Demonstrate the effectiveness of partitioning.
Apache Hive
•A component of the Hadoop framework for
processing large data sets.
•Distributed storage through HDFS (Hadoop
Distributed file storage).
•Distributed processing through Map Reduce.
•Converts SQL-like queries into Map Reduce code.
DISCLAIMER: This material is based upon work supported by the National Science Foundation and the Department of Defense under Grant No. CNS-1263183. An opinions,
findings, and conclusions or recommendation expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or
the Department of Defense.
Figure 2: Screenshot of Wireshark
Figure 1: Screenshot of Apache Hive
Common Analysis Strategies
•Filter by http.request, then look through hosts to find
suspicious traffic.
•Find traffic from all tcp ports but 80 with the SYN flag
set.
•Search for any external NETBIOS traffic.
Results
•The overhead associated with using the Virtual
Machine as well as Mapping and Reducing made queries
with small datasets slower in Hive than Wireshark.
•However, Wireshark could not run queries on datasets
much larger than 5 million packets.
•Even at 5 million packets, Hive was faster than
Wireshark.
•Hive has the capacity to continue adding packets while
maintaining a linear increase in query time. (figure 4)
•More performance can be shown when partitioning is
implemented (figure 3).
•Furthermore, Hive can easily add cores from
commodity hardware to improve performance further.
Figure 4 (left) compares
the performance of
Apache Hive to Wireshark
as the number of packets
processed increases.
Figure 3 (above) Is a screenshot demonstrating the
performance increase associated with partitioning
total packets
Hive Load
Time Query 1 avg Query 2 avg Query 3 avg Hive Query Average Wireshark Packets Wireshark Average
0 4.6 54.4666667 45.8666667 33.4333333 44.58888889 0 0
3679916 19.7 70.3333333 69.2666667 67.05 68.88333333 3679916 65.25
5753004 29.5 93 86.2333333 86.1666667 88.46666667 2073088 39.5
7068211 38.8 92.4666667 90.5 93.1333333 92.03333333 1315207 29
8006273 41.2 107.266667 107.233333 107.233333 107.2444444 938062 19.5
9493847 47.8 115.733333 113.9 117.1 115.5777778 1487574 29.75
9849142 51.8 131.866667 133.5 115.3 126.8888889 355295 8.75
11306882 54.8 144.6 121.033333 150.9 138.8444444 1457740 35.25
13188523 80.3 141.533333 159.3 140.366667 147.0666667 1881641 54.75
16547769 79.2 197.4 182.266667 198.833333 192.8333333 3359246 77
18815952 100.2 218.433333 243.133333 219.8 227.1222222 2268183 51.5
21424373 105.3 265.033333 238.866667 275.6 259.8333333 2608421 71.25
21646375 104.6 245.766667 274.066667 246.866667 255.5666667 222002 3.25
26297364 118.3 335.2 294.8 288.366667 306.1222222 4650989 93
31728102 154.5 334.666667 328.733333 391.65 351.6833333 5430738 106.5
35286631 169.4 318.133333 292.166667 296.233333 302.1777778 3558529 70
Figure 5 (right) is the data
used to plot figure 4.
Queries Used
• Query 1 >SELECT COUNT(*) FROM final WHERE info LIKE ‘%GET%’;
• Query 2 >SELECT COUNT(*) FROM final WHERE srcPort <> 80 AND flag LIKE ‘%0x0002%’;
• Query 3> SELECT COUNT (*) FROM final WHERE info LIKE ‘%NBSTAT%’;
Figure 6: Illustration of Apache Hadoop