SQL on Hadoop
benchmarks using
the TPC-DS query set
Sharon Kirkham, VP Analytics & Consulting, Kognitio
March 2017
1
Big data analytics for the business
 Where before, analytics might have been the domain of data analysts or
scientists, highly visual data analytics tools such as Tableau or MicroStrategy are
allowing analytics to become pervasive across the organization
 But these tools BI tools generate SQL (Structured Query Language) queries to
access data.
- They require the SQL access to be very fast…
- …and wider usage of data analytics within organizations requires support for
high concurrency access.
2
Enterprise level SQL on data in Hadoop
Many organizations have issues using the tools in standard Hadoop
distributions to support enterprise level SQL data in Hadoop, caused by:
3
SQL maturity
Some products cannot
handle all the SQL
generated by developers
and/or third party tools.
They either do not support
the SQL, or produce very
poor query plans
Query performance
Queries that are supported
perform poorly even under
single user workload
Concurrency
Products cannot handle
concurrency well in terms
of performance and give
errors when under load
4
How do SQL on Hadoop
engines perform?
Testing different SQL on Hadoop
 We tested
- Impala (version 2.6.0)
- Kognitio (version 8.1.50)
- Spark (version 2.0 beta).
 Using the same 12 node infrastructure running
Cloudera CDH 5.8.2.
- Each product was given all available resources for the benchmark
5
(Note: Standard Hive was originally investigated as part of this benchmark but lack of SQL support and poor
single thread performance meant it was removed from the benchmarks. In future testing, Hive with LLAP will
be included.)
Impala
schematic
6
• Impala 2.6.0 version was used. This
is the version shipped by Cloudera
in CDH 5.8.2
Impala was run outside of the YARN
resource manager on Hadoop.
• Data was held in Hive parquet
formatted files. The largest tables
were partitioned on the columns
most commonly used in joins.
• Statistics for Hive and Impala were
both gathered.
• Queries were submitted from the
edge node using the impala-shell
command line tool for each of the
randomised query streams in the
benchmark
7
Kognitio
schematic
• Kognitio version 8.1.50-rel20170105
was used
• Kognitio is deployed within the YARN
resource manager.
• Data was held in Kognitio RAM view
images. The larger data sets were
hashed on the columns most
commonly used in the joins. These
reside within the Kognitio YARN
containers and can be utilised by
multiple queries.
• Kognitio statistics were collected on
all views.
• Queries were submitted from the
edge node using the Kognitio
command line tool wxsubmit for each
of the randomised query streams in
the benchmark.
• Each query is executed within all
containers in the remaining RAM
available (not utilised by view
images)
8
SparkSQL
schematic
• Spark version used was 2.0. This version
was in beta test on Cloudera and is
available here:
https://blog.cloudera.com/blog/2016/09/apa
che-spark-2-0-beta-now-available-for-cdh/
• Spark was deployed within the YARN
resource manager on Hadoop.
• Data was held in Hive parquet formatted
files. The largest tables were partitioned on
the columns most commonly used in joins.
• Statistics for Hive were gathered.
• Queries were submitted from the edge
node using the sparksql command line tool
for each query in the randomised query
streams of the benchmark
• When a query is submitted to Spark it
requests containers from YARN to satisfy
that query depending on the resources it
estimates that it requires. This can lead to
many containers on each node for each
query. The vast majority of the errors
encountered in the Spark benchmark were
when it could not obtain the resources from
YARN it requested.
Using the TPC-DS benchmark
The TPC-DS benchmark is a well-respected, widely used query set that is representative
of the type of queries that seem to be most problematic. The TPC framework is also
designed for benchmarking concurrent workloads.
The benchmark can be interpreted as posing 3 distinct performance questions:
• Can the platform run the queries? = functional testing over 1GB data
• Can the platform perform at scale? = single stream over 1TB data
• How does the platform perform under load? = concurrent multiple streams over 1TB
9
10
Can the platform
run the queries?
Breadth of SQL supported by each
platform
11
Platform Impala Kognitio SparkSQL
Runs “Out of box” (no changes needed) 55 76 72
Minor syntax changes – such as removing
reserved words or ‘grammatical’ changes
18 23 27
Long running – SQL compiles but query
doesn’t come back within 1 hour
2
No support 24
The table above shows that for functional testing (over 1GB of data) both Kognitio and Spark can execute all 99 TPC-DS
queries. This is a big improvement for Spark from version 1.6 that could only execute 51 out of the 99 queries. Impala has some
way to go with SQL support: OLAP grouping sets, some sub-query functionality and set functions are still lacking.
12
Can the platform
perform at scale?
Query overview – single stream 1TB
13
Platform Impala Kognitio SparkSQL
Queries run 73 99 89
Long running 2 10
No support 24
Fastest query count 6 92 1
Single query stream at 1TB scale
14
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
NumberofQueries
Kognitio Vs Impala
Kognitio Impala
Kognitio faster.
Impala long running or can't run.
Kognitio
faster.
Impala faster.
equal speed2x 2x5x 5x10x 10x
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
NumberofQueries
Kognitio Vs Spark
Kognitio Spark
Kognitio
faster.
Kognitio faster.
Spark long running or can't run
Spark faster.
equal speed2x 2x5x 5x10x 10x
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
NumberofQueries
Impala Vs Spark
Impala Spark
Not included (can't run or long running).
Impala faster. Spark long running or can't run.
Impala
faster.
Spark faster.
Spark faster.
Impala long running or can't run
equal speed2x 2x5x 5x10x 10x
Each plot represents the relative speed between two of the platforms; Kognitio (blue), Impala (green) and Spark (orange).
Each query is represented by a horizontal block. The faster platform for a given query gets the largest proportion of the block.
Therefore the more a colour dominates the better that platform performs.
15
How does the platform
perform under load?
1TB benchmark run under increasing
workloads up to 10 query streams
16
Platform Impala Kognitio SparkSQL
Queries run in each stream 68 92 79
Long running 7 7 20
No support 24
Fastest query count 12 80 0
10 query streams at 1TB
17
0
10
20
30
40
50
60
70
80
90
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
NumberofQueries
Kognitio Vs Impala
Kognitio Impala
Kognitio faster.
Impala long running or can't run
Kognitio
faster.
Impala
faster.
Not included (long running or can't run).
equal speed2x 2x5x 5x
10
x
10
x
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
NumberofQueries
Kognitio Vs Spark
Kognitio Spark
Kognitio faster.
Spark long running or can't run.
Kognitio
faster.
Not included (long running or can't run).
equal speed2x 2x5x 5x
10
x
10
x
0
10
20
30
40
50
60
70
80
90
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
NumberofQueries
Impala Vs Spark
Impala Spark
Impala faster. Spark long running or can't run
Impala
faster.
Not included (long running or can't run).
Spark faster.
Impala long running or can't run
equal speed2x 2x5x 5x10x 10x
Read full whitepaper at:
kognitio.com/news-tpc-ds-benchmarks/
18

SQL on Hadoop benchmarks using TPC-DS query set

  • 1.
    SQL on Hadoop benchmarksusing the TPC-DS query set Sharon Kirkham, VP Analytics & Consulting, Kognitio March 2017 1
  • 2.
    Big data analyticsfor the business  Where before, analytics might have been the domain of data analysts or scientists, highly visual data analytics tools such as Tableau or MicroStrategy are allowing analytics to become pervasive across the organization  But these tools BI tools generate SQL (Structured Query Language) queries to access data. - They require the SQL access to be very fast… - …and wider usage of data analytics within organizations requires support for high concurrency access. 2
  • 3.
    Enterprise level SQLon data in Hadoop Many organizations have issues using the tools in standard Hadoop distributions to support enterprise level SQL data in Hadoop, caused by: 3 SQL maturity Some products cannot handle all the SQL generated by developers and/or third party tools. They either do not support the SQL, or produce very poor query plans Query performance Queries that are supported perform poorly even under single user workload Concurrency Products cannot handle concurrency well in terms of performance and give errors when under load
  • 4.
    4 How do SQLon Hadoop engines perform?
  • 5.
    Testing different SQLon Hadoop  We tested - Impala (version 2.6.0) - Kognitio (version 8.1.50) - Spark (version 2.0 beta).  Using the same 12 node infrastructure running Cloudera CDH 5.8.2. - Each product was given all available resources for the benchmark 5 (Note: Standard Hive was originally investigated as part of this benchmark but lack of SQL support and poor single thread performance meant it was removed from the benchmarks. In future testing, Hive with LLAP will be included.)
  • 6.
    Impala schematic 6 • Impala 2.6.0version was used. This is the version shipped by Cloudera in CDH 5.8.2 Impala was run outside of the YARN resource manager on Hadoop. • Data was held in Hive parquet formatted files. The largest tables were partitioned on the columns most commonly used in joins. • Statistics for Hive and Impala were both gathered. • Queries were submitted from the edge node using the impala-shell command line tool for each of the randomised query streams in the benchmark
  • 7.
    7 Kognitio schematic • Kognitio version8.1.50-rel20170105 was used • Kognitio is deployed within the YARN resource manager. • Data was held in Kognitio RAM view images. The larger data sets were hashed on the columns most commonly used in the joins. These reside within the Kognitio YARN containers and can be utilised by multiple queries. • Kognitio statistics were collected on all views. • Queries were submitted from the edge node using the Kognitio command line tool wxsubmit for each of the randomised query streams in the benchmark. • Each query is executed within all containers in the remaining RAM available (not utilised by view images)
  • 8.
    8 SparkSQL schematic • Spark versionused was 2.0. This version was in beta test on Cloudera and is available here: https://blog.cloudera.com/blog/2016/09/apa che-spark-2-0-beta-now-available-for-cdh/ • Spark was deployed within the YARN resource manager on Hadoop. • Data was held in Hive parquet formatted files. The largest tables were partitioned on the columns most commonly used in joins. • Statistics for Hive were gathered. • Queries were submitted from the edge node using the sparksql command line tool for each query in the randomised query streams of the benchmark • When a query is submitted to Spark it requests containers from YARN to satisfy that query depending on the resources it estimates that it requires. This can lead to many containers on each node for each query. The vast majority of the errors encountered in the Spark benchmark were when it could not obtain the resources from YARN it requested.
  • 9.
    Using the TPC-DSbenchmark The TPC-DS benchmark is a well-respected, widely used query set that is representative of the type of queries that seem to be most problematic. The TPC framework is also designed for benchmarking concurrent workloads. The benchmark can be interpreted as posing 3 distinct performance questions: • Can the platform run the queries? = functional testing over 1GB data • Can the platform perform at scale? = single stream over 1TB data • How does the platform perform under load? = concurrent multiple streams over 1TB 9
  • 10.
  • 11.
    Breadth of SQLsupported by each platform 11 Platform Impala Kognitio SparkSQL Runs “Out of box” (no changes needed) 55 76 72 Minor syntax changes – such as removing reserved words or ‘grammatical’ changes 18 23 27 Long running – SQL compiles but query doesn’t come back within 1 hour 2 No support 24 The table above shows that for functional testing (over 1GB of data) both Kognitio and Spark can execute all 99 TPC-DS queries. This is a big improvement for Spark from version 1.6 that could only execute 51 out of the 99 queries. Impala has some way to go with SQL support: OLAP grouping sets, some sub-query functionality and set functions are still lacking.
  • 12.
  • 13.
    Query overview –single stream 1TB 13 Platform Impala Kognitio SparkSQL Queries run 73 99 89 Long running 2 10 No support 24 Fastest query count 6 92 1
  • 14.
    Single query streamat 1TB scale 14 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 NumberofQueries Kognitio Vs Impala Kognitio Impala Kognitio faster. Impala long running or can't run. Kognitio faster. Impala faster. equal speed2x 2x5x 5x10x 10x 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 NumberofQueries Kognitio Vs Spark Kognitio Spark Kognitio faster. Kognitio faster. Spark long running or can't run Spark faster. equal speed2x 2x5x 5x10x 10x 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 NumberofQueries Impala Vs Spark Impala Spark Not included (can't run or long running). Impala faster. Spark long running or can't run. Impala faster. Spark faster. Spark faster. Impala long running or can't run equal speed2x 2x5x 5x10x 10x Each plot represents the relative speed between two of the platforms; Kognitio (blue), Impala (green) and Spark (orange). Each query is represented by a horizontal block. The faster platform for a given query gets the largest proportion of the block. Therefore the more a colour dominates the better that platform performs.
  • 15.
    15 How does theplatform perform under load?
  • 16.
    1TB benchmark rununder increasing workloads up to 10 query streams 16 Platform Impala Kognitio SparkSQL Queries run in each stream 68 92 79 Long running 7 7 20 No support 24 Fastest query count 12 80 0
  • 17.
    10 query streamsat 1TB 17 0 10 20 30 40 50 60 70 80 90 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 NumberofQueries Kognitio Vs Impala Kognitio Impala Kognitio faster. Impala long running or can't run Kognitio faster. Impala faster. Not included (long running or can't run). equal speed2x 2x5x 5x 10 x 10 x 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 NumberofQueries Kognitio Vs Spark Kognitio Spark Kognitio faster. Spark long running or can't run. Kognitio faster. Not included (long running or can't run). equal speed2x 2x5x 5x 10 x 10 x 0 10 20 30 40 50 60 70 80 90 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 NumberofQueries Impala Vs Spark Impala Spark Impala faster. Spark long running or can't run Impala faster. Not included (long running or can't run). Spark faster. Impala long running or can't run equal speed2x 2x5x 5x10x 10x
  • 18.
    Read full whitepaperat: kognitio.com/news-tpc-ds-benchmarks/ 18