Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark

“Performance
evaluation of cloud-
based log file analysis
with Apache Hadoop
and Apache Spark”
Written by
Ilias Mavridis & Eleni Karatza
Presented by
Kishor Datta Gupta

Log files
• Every action produce logs.
• Log production can reach to several TB per
day. FB have 300PB in 2014.
• Conventional Data Base cannot be used for
analysis
So cloud can help to analyze log data.

Clouds
problem
with log file
• Logs from number of components need to correlated first.
• Missing, duplicate, misleading data that makes log analysis
more complex.
• Analytical and statistical modeling techniques do not
always provide actionable insights.
This paper tried to answer which cloud computing techniques are suitable for
Log analysis.

Experimental
Setup
6 virtual machine connected with a private local
network.
KVM Hypervisor: CPU power 2100Mhz
1 master 5 Slave nodes
Master node has 8 virtual cores, 8 GB memory, 40 GB
Disk memory
Each slave node has 4 virtual cores, 6GB memory,
40GB Disk memory
Operating system Debian Base 8.2
JDK 1.7.0

Dataset
Apache HTTP server log file
Semi-structured text data
1GB size originally
For experiment purpose copied and merged
Data Saved in HDFS in blocks of 128MB
Using regular expression data imported as table in
HIVE.

Evaluation
Tools
Open Source Ganglia
Lightweight
Support both Hadoop and Spark
Have HDFS and Yarn metrics

Experimental
Scenarios ( 6
App)
Measure the request to the webserver and sorts
them by day.
10 most popular request and show in
descending order.
Search for abnormal big number of identical
request and displays the time and number of
request.
Examine DoS attack moment to list suspicious IP
address.
Search for error request and count them by type.
Find for request reply which create majority of
errors.

Request per
day VS DDoS
Detection

Findings
CPU utilization and Power
Consumption has no
relationship.
Ethernet Link Data transfer and
power consumption are not
proportional.

Decision
For time spark is better
For saving money Hadoop is better
Hive doesn’t boost performance much
Spark SQL reduce the performances
Hadoop and Spark Co exists.
Increasing Node after a certain number have no effect
in speedup.

Future work
Needed more complex log
analysis
Need to use physical watt-
machine to validate the power
model with more results

Thankyou……
Thankyou
Answer
Question

Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark

More Related Content

What's hot

Similar to Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark

More from Kishor Datta Gupta

Recently uploaded

Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark