Performance evaluation of cloudera impala (with Comparison to Hive)
Upcoming SlideShare
Loading in...5
×
 

Performance evaluation of cloudera impala (with Comparison to Hive)

on

  • 5,138 views

 

Statistics

Views

Total Views
5,138
Views on SlideShare
5,115
Embed Views
23

Actions

Likes
10
Downloads
110
Comments
0

2 Embeds 23

https://twitter.com 22
http://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Performance evaluation of cloudera impala (with Comparison to Hive) Performance evaluation of cloudera impala (with Comparison to Hive) Presentation Transcript

  • Cloudera  impala  Performance   Evaluation  (with  Comparison  to  Hive) Dec. 8, 2012 CELLANT Corp. R&D Strategy Division Yukinori SUDA @sudabon
  • About  Cloudera  impala •  Latest version is 0.3 beta•  Open-sourced implementation inspired by Google Dremel and F1•  Developed by famous Hadoop distributor Cloudera•  Bring real-time, ad-hoc query capability on Apache Hadoop•  Query data stored in HDFS or Apache Hbase•  Use the same metadata, SQL syntax (HiveQL) as Apache Hive•  Support for TextFile and SequenceFile as Hive storage format•  Also support SequenceFile compressed as Snappy, Gzip and Bzip•  Directly access the data through a specialized distributed query engine
  • Architecture •  State Store works as an impala-state-store(statestored) daemon•  Query Planner, Query Coordinator and Query Exec Engine work as an impalad daemon
  • System  Environment •  Install via Cloudera Manager Free Edition Master Slave・HDFS NameNode SecondaryNameNode ・HDFS ・MapReduceV1 DataNode JobTracker ・MapReduceV1 ・impala TaskTracker impalad ・impala impala-­‐‑state-­‐‑store impalad (statestored) 1  Sever 13  Servers All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch
  • Server  Specification •  CPU o  Intel Core 2 Duo 2.13 GHz with Hyper Threading•  Memory o  4GB•  Disk o  7,200 rpm SATA mechanical Hard Disk Drive•  OS o  CentOS 6.2
  • Benchmark •  Use CDH4.1 + impala version 0.2 and 0.3•  Use hivebench in open-sourced benchmark tool “HiBench” o  https://github.com/hibench•  Modified datasets to 1/10 scale o  Default configuration generates table with 1 billion rows•  Modified query sentence o  Deleted “INSERT INTO TABLE …” to evaluate read-only performance o  Deleted “datediff” function (I mistook not to be supported)•  Combines a few Hive storage format with a few compression method o  TextFile, SequenceFile, RCFile o  No compression, Gzip, Snappy•  Comparison with job query latency o  Average job latency over 5 measurements
  • Modified  Datasets •  Uservisits table •  Rankings table o  100 million rows o  12 million rows o  Schema o  Schema •  sourceIP string •  pageURL string •  destURL string •  pageRank int •  visitDate string •  avgDuration int •  adRevenue double •  userAgent string •  countryCode string •  languageCode string •  searchWord string •  duration int
  • Modified  Query SELECT ON sourceIP, (R.pageURL = NUV.destURL) sum(adRevenue) as totalRevenue, avg(pageRank) GROUP BY sourceIPFROM ORDER BY totalRevenue DESC rankings R LIMIT 1JOIN ( SELECT sourceIP, destURL, adRevenue FROM uservisits UV WHERE UV.visitData >= ‘1999-01-01’ AND UV.visitData <= ‘2001-01-01’ ) NUV
  • Benchmark  Result   (Hive)
  • Benchmark  Result   (impala  0.2)
  • Benchmark  Result   (impala  0.3)
  • Conclusion •  Impala is over 10 times faster than MR + Hive o  Impala 0.3 •  SequenceFile compressed as Snappy: 14.337 seconds o  Impala 0.2 •  SequenceFile compressed as Gzip: 19.733 seconds o  Hive •  RCFile compressed as Snappy: 164.161 seconds•  Hope that impala version 1.0 included in CDH5 makes faster o  Support RCFile and Trevni columner format
  • Thank  you