More Related Content
Similar to Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
Similar to Performance evaluation of cloudera impala 0.6 beta with comparison to Hive (20)
More from Yukinori Suda (7)
Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
- 1. Cloudera impala 0.6 beta
Performance Evaluation
(with Comparison to Hive)
Mar. 6, 2013
CELLANT Corp. R&D Strategy Division
Yukinori SUDA
@sudabon
1
Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
- 2. Cloudera impala 0.6 beta
v ChangeLogs from 0.5 beta
v Cloudera Manager 4.5 and CDH 4.2 support Impala 0.6.
v Support for the RCFile file format.
v Added support for Impala on SUSE and Debian/Ubuntu.
v RHEL5.7/6.2 and Centos5.7/6.2
v SUSE 11 with Service Pack 1 or later
v Ubuntu 10.04/12.04 and Debian 6.03
2
Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
- 3. System Environment
v Install via Cloudera Manager Free Edition 4.5.0
Master Slave
DataNode DataNode DataNode DataNode
Active
TaskTracker TaskTracker TaskTracker TaskTracker
NameNode
Impalad Impalad Impalad Impalad
DataNode DataNode DataNode DataNode
Stand-‐‑‒by
TaskTracker TaskTracker TaskTracker TaskTracker
NameNode
Impalad Impalad Impalad Impalad
DataNode
JobTracker DataNode DataNode
TaskTracker
statestored TaskTracker TaskTracker
Impalad
Impalad Impalad
3 Servers 11 Servers
All servers are connected with 1Gbps Ethernet through an L2 switch
3
Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
- 4. Server Specification
v CPU
l Intel Core 2 Duo 2.13 GHz with Hyper Threading
v Memory
l 4GB
v Disk
l 7,200 rpm SATA mechanical Hard Disk Drive
v OS
l Cent OS 6.2
4
Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
- 5. Benchmark
v Use CDH4.2.0 + impala version 0.6 beta
v Use hivebench in open-‐‑‒sourced benchmark tool “HiBench”
l https://github.com/hibench
v Modified datasets to 1/10 scale
l Default configuration generates table with 1 billion rows
v Modified query sentence
l Deleted “INSERT INTO TABLE …” to evaluate read-‐‑‒only performance
v Combines a few Hive storage format with a few compression
method
l TextFile, SequenceFile, RCFile
l No compression, Gzip, Snappy
v Comparison with job query latency
v Average job latency over 5 measurements
5
Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
- 6. Modified Datasets
• Uservisits table • Rankings table
– 100 million rows – 12 million rows
– Table Definitions – Table Definitions
• sourceIP string • pageURL string
• destURL string • pageRank int
• visitDate string • avgDuration int
• adRevenue double
• userAgent string
• countryCode string
• languageCode string
• searchWord string
• duration int
6
Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
- 7. Modified Query
SELECT ON
sourceIP, (R.pageURL = NUV.destURL)
sum(adRevenue) as totalRevenue, group by sourceIP
avg(pageRank) order by totalRevenue DESC
FROM limit 1;
rankings_̲t R
JOIN (
SELECT
sourceIP,
destURL,
adRevenue
FROM
uservisits_̲t UV
WHERE
(datediff(UV.visitDate, '1999-‐‑‒01-‐‑‒01')>=0
AND
datediff(UV.visitDate, '2000-‐‑‒01-‐‑‒01')<=0)
) NUV
7
Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
- 8. Benchmark Result (Hive)
197.894
Snappy
RCFile
234.289
Gzip
SequenceFile
213.616
Snappy
227.883
Gzip
TextFile
235.843
No Comp.
0 50 100 150 200 250
Avg. Job Latency [sec]
8
Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
- 9. Benchmark Result (impala)
16.059
Snappy
RCFile
17.03
Gzip
SequenceFile
17.725
Snappy
21.25
Gzip
TextFile
32.776
No Comp.
0 50 100 150 200 250
Avg. Job Latency [sec]
9
Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
- 10. Block Location Cache effect ?
TextFile
SequenceFile
RCFile
job
No Comp.
Gzip
Snappy
Gzip
Snappy
1st 50.256 23.692
22.085
18.475
20.042
2nd
34.905
20.710
19.733
16.690
18.859
3rd
30.752
20.604
15.608
16.620
16.642
4th
26.848
20.625
15.602
16.617
12.148
5th
21.121
20.620
15.597
16.747
12.606
Average
32.776
21.250
17.725
17.030
16.059
v 1st job is the slowest, and the fastest job is one of the others
due to Block Location Cache effect?
10
Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
- 11. Conclusion
v Impala is over 10 times faster than MRv1 +
Hive
v Specifically,
l Impala 0.6 beta
• RCFile compressed as Snappy: 16.059 sec
l MRv1 + Hive 0.10
• RCFile compressed as Snappy: 197.894 sec
v Hope that impala GA included in CDH5
makes faster
l Support Trevni columner format
l Optimized Query Planner
11
Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
- 12. Thanks.
12
Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/