Performance evaluation of cloudera impala (with Comparison to Hive)


Published on

Published in: Technology

Performance evaluation of cloudera impala (with Comparison to Hive)

  1. 1. Cloudera  impala  Performance   Evaluation  (with  Comparison  to  Hive) Dec. 8, 2012 CELLANT Corp. R&D Strategy Division Yukinori SUDA @sudabon
  2. 2. About  Cloudera  impala •  Latest version is 0.3 beta•  Open-sourced implementation inspired by Google Dremel and F1•  Developed by famous Hadoop distributor Cloudera•  Bring real-time, ad-hoc query capability on Apache Hadoop•  Query data stored in HDFS or Apache Hbase•  Use the same metadata, SQL syntax (HiveQL) as Apache Hive•  Support for TextFile and SequenceFile as Hive storage format•  Also support SequenceFile compressed as Snappy, Gzip and Bzip•  Directly access the data through a specialized distributed query engine
  3. 3. Architecture •  State Store works as an impala-state-store(statestored) daemon•  Query Planner, Query Coordinator and Query Exec Engine work as an impalad daemon
  4. 4. System  Environment •  Install via Cloudera Manager Free Edition Master Slave・HDFS NameNode SecondaryNameNode ・HDFS ・MapReduceV1 DataNode JobTracker ・MapReduceV1 ・impala TaskTracker impalad ・impala impala-­‐‑state-­‐‑store impalad (statestored) 1  Sever 13  Servers All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch
  5. 5. Server  Specification •  CPU o  Intel Core 2 Duo 2.13 GHz with Hyper Threading•  Memory o  4GB•  Disk o  7,200 rpm SATA mechanical Hard Disk Drive•  OS o  CentOS 6.2
  6. 6. Benchmark •  Use CDH4.1 + impala version 0.2 and 0.3•  Use hivebench in open-sourced benchmark tool “HiBench” o•  Modified datasets to 1/10 scale o  Default configuration generates table with 1 billion rows•  Modified query sentence o  Deleted “INSERT INTO TABLE …” to evaluate read-only performance o  Deleted “datediff” function (I mistook not to be supported)•  Combines a few Hive storage format with a few compression method o  TextFile, SequenceFile, RCFile o  No compression, Gzip, Snappy•  Comparison with job query latency o  Average job latency over 5 measurements
  7. 7. Modified  Datasets •  Uservisits table •  Rankings table o  100 million rows o  12 million rows o  Schema o  Schema •  sourceIP string •  pageURL string •  destURL string •  pageRank int •  visitDate string •  avgDuration int •  adRevenue double •  userAgent string •  countryCode string •  languageCode string •  searchWord string •  duration int
  8. 8. Modified  Query SELECT ON sourceIP, (R.pageURL = NUV.destURL) sum(adRevenue) as totalRevenue, avg(pageRank) GROUP BY sourceIPFROM ORDER BY totalRevenue DESC rankings R LIMIT 1JOIN ( SELECT sourceIP, destURL, adRevenue FROM uservisits UV WHERE UV.visitData >= ‘1999-01-01’ AND UV.visitData <= ‘2001-01-01’ ) NUV
  9. 9. Benchmark  Result   (Hive)
  10. 10. Benchmark  Result   (impala  0.2)
  11. 11. Benchmark  Result   (impala  0.3)
  12. 12. Conclusion •  Impala is over 10 times faster than MR + Hive o  Impala 0.3 •  SequenceFile compressed as Snappy: 14.337 seconds o  Impala 0.2 •  SequenceFile compressed as Gzip: 19.733 seconds o  Hive •  RCFile compressed as Snappy: 164.161 seconds•  Hope that impala version 1.0 included in CDH5 makes faster o  Support RCFile and Trevni columner format
  13. 13. Thank  you