SQL on Hadoop benchmarks using TPC-DS query set

SQL on Hadoop
benchmarks using
the TPC-DS query set
Sharon Kirkham, VP Analytics & Consulting, Kognitio
March 2017
1

Big data analytics for the business
 Where before, analytics might have been the domain of data analysts or
scientists, highly visual data analytics tools such as Tableau or MicroStrategy are
allowing analytics to become pervasive across the organization
 But these tools BI tools generate SQL (Structured Query Language) queries to
access data.
- They require the SQL access to be very fast…
- …and wider usage of data analytics within organizations requires support for
high concurrency access.
2

Enterprise level SQL on data in Hadoop
Many organizations have issues using the tools in standard Hadoop
distributions to support enterprise level SQL data in Hadoop, caused by:
3
SQL maturity
Some products cannot
handle all the SQL
generated by developers
and/or third party tools.
They either do not support
the SQL, or produce very
poor query plans
Query performance
Queries that are supported
perform poorly even under
single user workload
Concurrency
Products cannot handle
concurrency well in terms
of performance and give
errors when under load

4
How do SQL on Hadoop
engines perform?

Testing different SQL on Hadoop
 We tested
- Impala (version 2.6.0)
- Kognitio (version 8.1.50)
- Spark (version 2.0 beta).
 Using the same 12 node infrastructure running
Cloudera CDH 5.8.2.
- Each product was given all available resources for the benchmark
5
(Note: Standard Hive was originally investigated as part of this benchmark but lack of SQL support and poor
single thread performance meant it was removed from the benchmarks. In future testing, Hive with LLAP will
be included.)

Impala
schematic
6
• Impala 2.6.0 version was used. This
is the version shipped by Cloudera
in CDH 5.8.2
Impala was run outside of the YARN
resource manager on Hadoop.
• Data was held in Hive parquet
formatted files. The largest tables
were partitioned on the columns
most commonly used in joins.
• Statistics for Hive and Impala were
both gathered.
• Queries were submitted from the
edge node using the impala-shell
command line tool for each of the
randomised query streams in the
benchmark

7
Kognitio
schematic
• Kognitio version 8.1.50-rel20170105
was used
• Kognitio is deployed within the YARN
resource manager.
• Data was held in Kognitio RAM view
images. The larger data sets were
hashed on the columns most
commonly used in the joins. These
reside within the Kognitio YARN
containers and can be utilised by
multiple queries.
• Kognitio statistics were collected on
all views.
• Queries were submitted from the
edge node using the Kognitio
command line tool wxsubmit for each
of the randomised query streams in
the benchmark.
• Each query is executed within all
containers in the remaining RAM
available (not utilised by view
images)

8
SparkSQL
schematic
• Spark version used was 2.0. This version
was in beta test on Cloudera and is
available here:
https://blog.cloudera.com/blog/2016/09/apa
che-spark-2-0-beta-now-available-for-cdh/
• Spark was deployed within the YARN
resource manager on Hadoop.
• Data was held in Hive parquet formatted
files. The largest tables were partitioned on
the columns most commonly used in joins.
• Statistics for Hive were gathered.
• Queries were submitted from the edge
node using the sparksql command line tool
for each query in the randomised query
streams of the benchmark
• When a query is submitted to Spark it
requests containers from YARN to satisfy
that query depending on the resources it
estimates that it requires. This can lead to
many containers on each node for each
query. The vast majority of the errors
encountered in the Spark benchmark were
when it could not obtain the resources from
YARN it requested.

Using the TPC-DS benchmark
The TPC-DS benchmark is a well-respected, widely used query set that is representative
of the type of queries that seem to be most problematic. The TPC framework is also
designed for benchmarking concurrent workloads.
The benchmark can be interpreted as posing 3 distinct performance questions:
• Can the platform run the queries? = functional testing over 1GB data
• Can the platform perform at scale? = single stream over 1TB data
• How does the platform perform under load? = concurrent multiple streams over 1TB
9

10
Can the platform
run the queries?

Breadth of SQL supported by each
platform
11
Platform Impala Kognitio SparkSQL
Runs “Out of box” (no changes needed) 55 76 72
Minor syntax changes – such as removing
reserved words or ‘grammatical’ changes
18 23 27
Long running – SQL compiles but query
doesn’t come back within 1 hour
2
No support 24
The table above shows that for functional testing (over 1GB of data) both Kognitio and Spark can execute all 99 TPC-DS
queries. This is a big improvement for Spark from version 1.6 that could only execute 51 out of the 99 queries. Impala has some
way to go with SQL support: OLAP grouping sets, some sub-query functionality and set functions are still lacking.

12
Can the platform
perform at scale?

Query overview – single stream 1TB
13
Queries run 73 99 89
Long running 2 10
No support 24
Fastest query count 6 92 1

Single query stream at 1TB scale
14
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
NumberofQueries
Kognitio Vs Impala
Kognitio Impala
Kognitio faster.
Impala long running or can't run.
Kognitio
faster.
Impala faster.
equal speed2x 2x5x 5x10x 10x
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
NumberofQueries
Kognitio Vs Spark
Kognitio Spark
Kognitio
faster.
Kognitio faster.
Spark long running or can't run
Spark faster.
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
NumberofQueries
Impala Vs Spark
Impala Spark
Not included (can't run or long running).
Impala faster. Spark long running or can't run.
Impala
faster.
Spark faster.
Spark faster.
Impala long running or can't run
Each plot represents the relative speed between two of the platforms; Kognitio (blue), Impala (green) and Spark (orange).
Each query is represented by a horizontal block. The faster platform for a given query gets the largest proportion of the block.
Therefore the more a colour dominates the better that platform performs.

15
How does the platform
perform under load?

1TB benchmark run under increasing
workloads up to 10 query streams
16
Queries run in each stream 68 92 79
Long running 7 7 20
No support 24
Fastest query count 12 80 0

10 query streams at 1TB
17
0
10
20
30
40
50
60
70
80
90
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
NumberofQueries
Kognitio Vs Impala
Kognitio Impala
Kognitio faster.
Kognitio
faster.
Impala
faster.
Not included (long running or can't run).
equal speed2x 2x5x 5x
10
x
10
x
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
NumberofQueries
Kognitio Vs Spark
Kognitio Spark
Kognitio faster.
Spark long running or can't run.
Kognitio
faster.
equal speed2x 2x5x 5x
10
x
10
x
0
10
20
30
40
50
60
70
80
90
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
NumberofQueries
Impala Vs Spark
Impala Spark
Impala faster. Spark long running or can't run
Impala
faster.
Spark faster.

Read full whitepaper at:
kognitio.com/news-tpc-ds-benchmarks/
18

SQL on Hadoop benchmarks using TPC-DS query set

More Related Content

What's hot

Similar to SQL on Hadoop benchmarks using TPC-DS query set

Recently uploaded

SQL on Hadoop benchmarks using TPC-DS query set