Performance evaluation of cloudera impala (with Comparison to Hive)

Cloudera impala Performance
Evaluation
（with Comparison to Hive）

Dec. 8, 2012
CELLANT Corp. R&D Strategy Division
Yukinori SUDA
@sudabon

About Cloudera impala

•  Latest version is 0.3 beta
•  Open-sourced implementation inspired by Google Dremel
and F1
•  Developed by famous Hadoop distributor Cloudera
•  Bring real-time, ad-hoc query capability on Apache Hadoop
•  Query data stored in HDFS or Apache Hbase
•  Use the same metadata, SQL syntax (HiveQL) as Apache Hive
•  Support for TextFile and SequenceFile as Hive storage format
•  Also support SequenceFile compressed as Snappy, Gzip and
Bzip
•  Directly access the data through a specialized distributed
query engine

Architecture

•  State Store works as an impala-state-store(statestored) daemon
•  Query Planner, Query Coordinator and Query Exec Engine work as an
impalad daemon

System Environment

•  Install via Cloudera Manager Free Edition
Master Slave

・HDFS
NameNode
SecondaryNameNode
・HDFS
・MapReduceV1
DataNode
JobTracker
・MapReduceV1
・impala
TaskTracker
impalad
・impala
impala-‐‑state-‐‑store
impalad
(statestored)
1 Sever 13 Servers

All servers are connected with 1Gbps Ethernet through an L2 switch

Server Speciﬁcation

•  CPU
o  Intel Core 2 Duo 2.13 GHz with Hyper Threading

•  Memory
o  4GB

•  Disk
o  7,200 rpm SATA mechanical Hard Disk Drive

•  OS
o  CentOS 6.2

Benchmark

•  Use CDH4.1 + impala version 0.2 and 0.3
•  Use hivebench in open-sourced benchmark tool
“HiBench”
o  https://github.com/hibench
•  Modified datasets to 1/10 scale
o  Default configuration generates table with 1 billion rows
•  Modified query sentence
o  Deleted “INSERT INTO TABLE …” to evaluate read-only performance
o  Deleted “datediff” function (I mistook not to be supported)
•  Combines a few Hive storage format with a few
compression method
o  TextFile, SequenceFile, RCFile
o  No compression, Gzip, Snappy
•  Comparison with job query latency
o  Average job latency over 5 measurements

Modiﬁed Datasets

•  Uservisits table •  Rankings table
o  100 million rows o  12 million rows
o  Schema o  Schema
•  sourceIP string •  pageURL string
•  destURL string •  pageRank int
•  visitDate string •  avgDuration int
•  adRevenue double
•  userAgent string
•  countryCode string
•  languageCode string
•  searchWord string
•  duration int

Modiﬁed Query

SELECT ON
sourceIP, (R.pageURL = NUV.destURL)
sum(adRevenue) as totalRevenue,
avg(pageRank)
GROUP BY sourceIP
FROM ORDER BY totalRevenue DESC
rankings R LIMIT 1
JOIN (
SELECT
sourceIP,
destURL,
adRevenue
FROM
uservisits UV
WHERE
UV.visitData >= ‘1999-01-01’
AND UV.visitData <= ‘2001-01-01’
) NUV

Benchmark Result
（Hive）

Benchmark Result
（impala 0.2）

Benchmark Result
（impala 0.3）

Conclusion

•  Impala is over 10 times faster than MR + Hive
o  Impala 0.3
•  SequenceFile compressed as Snappy: 14.337 seconds
o  Impala 0.2
•  SequenceFile compressed as Gzip: 19.733 seconds
o  Hive
•  RCFile compressed as Snappy: 164.161 seconds

•  Hope that impala version 1.0 included in CDH5
makes faster
o  Support RCFile and Trevni columner format

Performance evaluation of cloudera impala (with Comparison to Hive)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Performance evaluation of cloudera impala (with Comparison to Hive)

Similar to Performance evaluation of cloudera impala (with Comparison to Hive) (20)

More from Yukinori Suda

More from Yukinori Suda (9)

Recently uploaded

Recently uploaded (20)

Performance evaluation of cloudera impala (with Comparison to Hive)