Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
1
1
Evaluation of Cloudera impala 1.1
Aug 7, 2013
CELLANT Corp. R&D Strategy Division
Yukinori SUDA
@sudabon

v  Sentry support:
l  Fine-‐‑‒grained authorization
l  Role-‐‑‒based authorization
v  Support for views
v  Performance improvements
l  Parquet columnar performance
l  More eﬃcient metadata refresh for larger installations
v  Additional SQL
l  SQL-‐‑‒89 joins (in addition to existing SQL-‐‑‒92)
l  LOAD function
l  REFRESH command for JDBC/ODBC
v  Improved Hbase support:
l  Binary types
l  Caching conﬁguration
v  Fixed many bugs
Cloudera Impala 1.1 was released !!
2

v Hive ⇒ Impala
l On Impala shell, can read data in “VIEW” that was
created via Hive command ?
v Impala ⇒ Hive
l On Hive shell, can read data in “VIEW” that was
created via Impala command ?
v Result
Two “VIEW”s have compatibility
Check compatibility of “VIEW”
3

Check performance (Hive on Cluster1)
4
0 50 100 150 200 250
No Comp.
Gzip
Snappy
Gzip
Snappy
TextFileSequenceFileRCFile
222.039
244.67
239.182
228.801
230.327
Avg. Job Latency [sec]
This result will be invalid as performance evaluation cause some data may be read remotely.
See the slide of “Check performance (Hive on Cluster2)”.

Check performance (Impala on Cluster1)
5
0 50 100 150 200 250
No Comp.
Gzip
Snappy
Gzip
Snappy
Snappy
Text
File
Sequence
FileRCFile
Parquet
File
23.518
32.155
28.617
20.774
12.654
13.146
This result will be invalid as performance evaluation
cause some data may be read remotely.
See the slide of “Check performance (Impala on Cluster2)”.

Check performance (Hive on Cluster2)
6
0 50 100 150 200 250 300
No Comp.
Gzip
Snappy
Gzip
Snappy
TextFileSequenceFileRCFile
272.176
249.531
245.009
230.034
216.802

Check performance (Impala on Cluster2)
7
0 50 100 150 200 250 300
No Comp.
Gzip
Snappy
Gzip
Snappy
Snappy
Text
File
Sequence
FileRCFile
Parquet
File
32.528
28.73
21.173
24.794
14.308
19.814

v IMPALA-‐‑‒357
l Insert into Parquet exceed mem-‐‑‒limit
v Problem
l Even if set mem_̲limit setting, when create ParquetFile
table with partitions, consumed memory isnʼ’t limited.
l At last, Impalad crashes due to memory shortage
v Result
CREATE command failed due to memory limit
Check ﬁxed bug
8

v Thanks to dev. team, Impala is also going
from “Good to Great”
v Both “VIEW” and “Parquet” are already ready
v Performance
v RCFile+Snappy is the fastest on both Cluster1 and
Cluster2
v If use larger size table, Parquet+Snappy may be the
fastest
v Hope for future extension
l Support Structure Types
l Support UDF/UDTF, etc
Summary
9

10
Appendix. Benchmark Details

Our System Environment(Cluster1)
11
v  Install using Cloudera Manager Free Edition 4.6.0
Master Slave
14 Servers
All servers are connected with 1Gbps Ethernet through an L2 switch
Active
NameNode
DataNode
TaskTracker
Impalad
Stand-‐‑‒by
NameNode
JobTracker
statestored
3 Servers
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
DataNode
DataNode
DataNode

Our System Environment(Cluster2)
12
v  Install using Cloudera Manager Free Edition 4.6.0
Master Slave
10 Servers
All servers are connected with 1Gbps Ethernet through an L2 switch
Active
NameNode
DataNode
TaskTracker
Impalad
Stand-‐‑‒by
NameNode
JobTracker
statestored
3 Servers
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
DataNode
DataNode
DataNode
Decommissioned

v CPU
l Intel Core 2 Duo 2.13 GHz with Hyper Threading
v Memory
l 8GB : Namenodes only
l 4GB : Others
v Disk
l 7,200 rpm SATA mechanical Hard Disk Drive * 1
v OS
l Cent OS 6.3
Our Server Speciﬁcation
13

v  Use CDH4.3.0 + Impala 1.1
v  Use hivebench in open-‐‑‒sourced benchmark tool “HiBench”
l  https://github.com/hibench
v  Modified datasets to 1/10 scale
l  Default configuration generates table with 1 billion rows
v  Modified query sentence
l  Deleted “INSERT INTO TABLE …” to evaluate read-‐‑‒only performance
v  Combines a few storage format with a few compression method
l  TextFile, SequenceFile, RCFile, ParquestFile
l  No compression, Gzip, Snappy
v  Comparison with job query latency
v  Average job latency over 5 measurements
v  Benchmark on both Cluster1 and Cluster2
Benchmark
14

•  Uservisits table
–  100 million rows
–  16,895 MB as TextFile
–  Table Definitions
•  sourceIP string
•  destURL string
•  visitDate string
•  adRevenue double
•  userAgent string
•  countryCode string
•  languageCode string
•  searchWord string
•  duration int
•  Rankings table
–  12 million rows
–  744 MB as TextFile
–  Table Definitions
•  pageURL string
•  pageRank int
•  avgDuration int
Modified Datasets
15

SELECT
　sourceIP,
　sum(adRevenue) as totalRevenue,
　avg(pageRank)
FROM
　rankings_̲t R
JOIN [BROADCAST] (
　SELECT
　　sourceIP,
　　destURL,
　　adRevenue
　FROM
　　uservisits_̲t UV
　WHERE
　　(datediff(UV.visitDate, '1999-‐‑‒01-‐‑‒01')>=0
　　AND
　　datediff(UV.visitDate, '2000-‐‑‒01-‐‑‒01')<=0)
　) NUV
ON
　(R.pageURL = NUV.destURL)
group by sourceIP
order by totalRevenue DESC
limit 1;
Modified Query
16

17
Thanks!
I want to use TPC in next evaluation…

Evaluation of cloudera impala 1.1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Evaluation of cloudera impala 1.1

Similar to Evaluation of cloudera impala 1.1 (20)

More from Yukinori Suda

More from Yukinori Suda (7)

Recently uploaded

Recently uploaded (20)

Evaluation of cloudera impala 1.1