“Performance
evaluation of cloud-
based log file analysis
with Apache Hadoop
and Apache Spark”
Written by
Ilias Mavridis & Eleni Karatza
Presented by
Kishor Datta Gupta
Log files
• Every action produce logs.
• Log production can reach to several TB per
day. FB have 300PB in 2014.
• Conventional Data Base cannot be used for
analysis
So cloud can help to analyze log data.
Clouds
problem
with log file
• Logs from number of components need to correlated first.
• Missing, duplicate, misleading data that makes log analysis
more complex.
• Analytical and statistical modeling techniques do not
always provide actionable insights.
This paper tried to answer which cloud computing techniques are suitable for
Log analysis.
Experimental
Setup
6 virtual machine connected with a private local
network.
KVM Hypervisor: CPU power 2100Mhz
1 master 5 Slave nodes
Master node has 8 virtual cores, 8 GB memory, 40 GB
Disk memory
Each slave node has 4 virtual cores, 6GB memory,
40GB Disk memory
Operating system Debian Base 8.2
JDK 1.7.0
Dataset
Apache HTTP server log file
Semi-structured text data
1GB size originally
For experiment purpose copied and merged
Data Saved in HDFS in blocks of 128MB
Using regular expression data imported as table in
HIVE.
Evaluation
Tools
Open Source Ganglia
Lightweight
Support both Hadoop and Spark
Have HDFS and Yarn metrics
Experimental
Scenarios ( 6
App)
Measure the request to the webserver and sorts
them by day.
10 most popular request and show in
descending order.
Search for abnormal big number of identical
request and displays the time and number of
request.
Examine DoS attack moment to list suspicious IP
address.
Search for error request and count them by type.
Find for request reply which create majority of
errors.
Execution time of Hadoop
Request per
day VS DDoS
Detection
Mean
Execution
time
Execution Time of Spark
Power Consumption Model
Findings
CPU utilization and Power
Consumption has no
relationship.
Ethernet Link Data transfer and
power consumption are not
proportional.
Decision
For time spark is better
For saving money Hadoop is better
Hive doesn’t boost performance much
Spark SQL reduce the performances
Hadoop and Spark Co exists.
Increasing Node after a certain number have no effect
in speedup.
Future work
Needed more complex log
analysis
Need to use physical watt-
machine to validate the power
model with more results
Thankyou……
Thankyou
Answer
Question

Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark

  • 1.
    “Performance evaluation of cloud- basedlog file analysis with Apache Hadoop and Apache Spark” Written by Ilias Mavridis & Eleni Karatza Presented by Kishor Datta Gupta
  • 2.
    Log files • Everyaction produce logs. • Log production can reach to several TB per day. FB have 300PB in 2014. • Conventional Data Base cannot be used for analysis So cloud can help to analyze log data.
  • 3.
    Clouds problem with log file •Logs from number of components need to correlated first. • Missing, duplicate, misleading data that makes log analysis more complex. • Analytical and statistical modeling techniques do not always provide actionable insights. This paper tried to answer which cloud computing techniques are suitable for Log analysis.
  • 4.
    Experimental Setup 6 virtual machineconnected with a private local network. KVM Hypervisor: CPU power 2100Mhz 1 master 5 Slave nodes Master node has 8 virtual cores, 8 GB memory, 40 GB Disk memory Each slave node has 4 virtual cores, 6GB memory, 40GB Disk memory Operating system Debian Base 8.2 JDK 1.7.0
  • 5.
    Dataset Apache HTTP serverlog file Semi-structured text data 1GB size originally For experiment purpose copied and merged Data Saved in HDFS in blocks of 128MB Using regular expression data imported as table in HIVE.
  • 6.
    Evaluation Tools Open Source Ganglia Lightweight Supportboth Hadoop and Spark Have HDFS and Yarn metrics
  • 7.
    Experimental Scenarios ( 6 App) Measurethe request to the webserver and sorts them by day. 10 most popular request and show in descending order. Search for abnormal big number of identical request and displays the time and number of request. Examine DoS attack moment to list suspicious IP address. Search for error request and count them by type. Find for request reply which create majority of errors.
  • 8.
  • 9.
    Request per day VSDDoS Detection
  • 10.
  • 11.
  • 20.
  • 21.
    Findings CPU utilization andPower Consumption has no relationship. Ethernet Link Data transfer and power consumption are not proportional.
  • 22.
    Decision For time sparkis better For saving money Hadoop is better Hive doesn’t boost performance much Spark SQL reduce the performances Hadoop and Spark Co exists. Increasing Node after a certain number have no effect in speedup.
  • 23.
    Future work Needed morecomplex log analysis Need to use physical watt- machine to validate the power model with more results
  • 24.