Performance evaluation of cloudera impala 0.6 beta with comparison to Hive

Cloudera impala 0.6 beta
Performance Evaluation
(with Comparison to Hive)

Mar. 6, 2013
CELLANT Corp. R&D Strategy Division
Yukinori SUDA
@sudabon

1
Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/

Cloudera impala 0.6 beta

v  ChangeLogs from 0.5 beta
v  Cloudera Manager 4.5 and CDH 4.2 support Impala 0.6.
v  Support for the RCFile ﬁle format.
v  Added support for Impala on SUSE and Debian/Ubuntu.
v RHEL5.7/6.2 and Centos5.7/6.2
v SUSE 11 with Service Pack 1 or later
v Ubuntu 10.04/12.04 and Debian 6.03

2

System Environment
v  Install via Cloudera Manager Free Edition 4.5.0

Master Slave

DataNode DataNode DataNode DataNode
Active
TaskTracker TaskTracker TaskTracker TaskTracker
NameNode
Impalad Impalad Impalad Impalad

DataNode DataNode DataNode DataNode
Stand-‐‑‒by
TaskTracker TaskTracker TaskTracker TaskTracker
NameNode
Impalad Impalad Impalad Impalad

DataNode
JobTracker DataNode DataNode
TaskTracker
statestored TaskTracker TaskTracker
Impalad
Impalad Impalad

3 Servers 11 Servers

All servers are connected with 1Gbps Ethernet through an L2 switch
3

Server Speciﬁcation

v CPU
l  Intel Core 2 Duo 2.13 GHz with Hyper Threading
v Memory
l  4GB
v Disk
l  7,200 rpm SATA mechanical Hard Disk Drive
v OS
l  Cent OS 6.2

4

Benchmark

v  Use CDH4.2.0 + impala version 0.6 beta
v  Use hivebench in open-‐‑‒sourced benchmark tool “HiBench”
l  https://github.com/hibench
v  Modified datasets to 1/10 scale
l  Default configuration generates table with 1 billion rows
v  Modified query sentence
l  Deleted “INSERT INTO TABLE …” to evaluate read-‐‑‒only performance
v  Combines a few Hive storage format with a few compression
method
l  TextFile, SequenceFile, RCFile
l  No compression, Gzip, Snappy
v  Comparison with job query latency
v  Average job latency over 5 measurements

5

Modified Datasets

•  Uservisits table •  Rankings table
–  100 million rows –  12 million rows
–  Table Definitions –  Table Definitions
•  sourceIP string •  pageURL string
•  destURL string •  pageRank int
•  visitDate string •  avgDuration int
•  adRevenue double
•  userAgent string
•  countryCode string
•  languageCode string
•  searchWord string
•  duration int

6

Modified Query
SELECT ON
　sourceIP, 　(R.pageURL = NUV.destURL)
　sum(adRevenue) as totalRevenue, group by sourceIP
　avg(pageRank) order by totalRevenue DESC
FROM limit 1;
　rankings_̲t R
JOIN (
　SELECT
　　sourceIP,
　　destURL,
　　adRevenue
　FROM
　　uservisits_̲t UV
　WHERE
　　(datediff(UV.visitDate, '1999-‐‑‒01-‐‑‒01')>=0
　　AND
　　datediff(UV.visitDate, '2000-‐‑‒01-‐‑‒01')<=0)
　) NUV

7

Benchmark Result (Hive)

197.894
Snappy
RCFile

234.289
Gzip
SequenceFile

213.616
Snappy

227.883
Gzip
TextFile

235.843
No Comp.

0 50 100 150 200 250

Avg. Job Latency [sec]

8

Benchmark Result (impala)

16.059
Snappy
RCFile

17.03
Gzip
SequenceFile

17.725
Snappy

21.25
Gzip
TextFile

32.776
No Comp.

0 50 100 150 200 250

Avg. Job Latency [sec]

9

Block Location Cache eﬀect ?

TextFile
SequenceFile
RCFile
job
No Comp.
Gzip
Snappy
Gzip
Snappy
1st 50.256 23.692
22.085
18.475
20.042
2nd
34.905
20.710
19.733
16.690
18.859
3rd
30.752
20.604
15.608
16.620
16.642
4th
26.848
20.625
15.602
16.617
12.148
5th
21.121
20.620
15.597
16.747
12.606
Average
32.776
21.250
17.725
17.030
16.059

v  1st job is the slowest, and the fastest job is one of the others
due to Block Location Cache eﬀect?

10

Conclusion

v Impala is over 10 times faster than MRv1 +
Hive
v Speciﬁcally,
l  Impala 0.6 beta
•  RCFile compressed as Snappy: 16.059 sec
l  MRv1 + Hive 0.10
•  RCFile compressed as Snappy: 197.894 sec
v Hope that impala GA included in CDH5
makes faster
l  Support Trevni columner format
l  Optimized Query Planner
11

Thanks.

12

Performance evaluation of cloudera impala 0.6 beta with comparison to Hive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Performance evaluation of cloudera impala 0.6 beta with comparison to Hive

Similar to Performance evaluation of cloudera impala 0.6 beta with comparison to Hive (20)

More from Yukinori Suda

More from Yukinori Suda (7)

Performance evaluation of cloudera impala 0.6 beta with comparison to Hive