Evaluation of TPC-H on Spark and Spark SQL in ALOJA

AN EVALUATION OF TPC-H
ON SPARK & SPARK SQL IN ALOJA
M.SC. RAPHAEL RADOWITZ @DATAWORKS SUMMIT, BERLIN 19TH APRIL 2018
FRANKFURT BIG DATA LAB @GOETHE UNIVERSITY

AGENDA
 Motivation & Research Objectives
 Spark
 Ecosystem
 Data Access
 ALOJA & TPC-H
 Spark SQL with or without Hive Metastore
 File Formats
 Correlation Analysis
 Query Analysis
 Summary
Thursday, April 19, 2018 2

SPARK SCALA & SPARK SQL
Do you Want to improve your Apache Spark
performance?

QUESTION'S ADDRESSED IN THIS SESSION
1. Should I use Spark Scala or Spark SQL?
2. Does Hive Metastore have an impact on the performance?
3. Should I consider a certain File Format?
 Master thesis: “Evaluation of TPC-H on Spark & Spark SQL in ALOJA”

OUTCOME OF THE PERFORMANCE EVALUATION
1. Up to 30% of performance increase by switching between Spark Scala &
Spark SQL
2. Hive Metastore produces an overhead
3. File Format and compression increases performance
 Parquet with Snappy compression is the best choice
 Performance Evaluation conducted on Spark 2.1.1

MOTIVATION & RESEARCH OBJECTIVES
 Absence of a comprehensive performance evaluation of
Spark SQL compared to Spark Scala
 Investigating the performance impact of Spark SQL and Spark Scala
 Investigating the influence of Hive’s Metastore on performance
 The attempt to detect possible bottlenecks in terms of runtime
 Impact of various alternate file formats with different applied compressions
 Implement a Spark Scala TPC-H benchmark within ALOJA
 Benchmark is publicly accessible on GitHub

ALOJA
 Benchmark platform to characterize cost-effectiveness of Big Data
deployments
 https://aloja.bsc.es/
 https://github.com/Aloja/aloja
 Collaboration with the Barcelona Super Computer Center (BSC)
 Nicolas Poggi
 Alejandro Montero

TPC-H BENCHMARK
 Popular decision support benchmark
 Composed of eight different sized tables
 22 complex business oriented ad-hoc queries

SPARK ECOSYSTEM / INTERFACES
https://pages.databricks.com/rs/094-YMS-629/images/SparkSQLSigmod2015.pdf

Thursday, April 19,
2018
10
 Data access from Spark on HDFS
 With or without Metastore
 Data File Formats: Text, ORC & Parquet
 Dataset API
DATA
ACCESS

FILE FORMATS
 Text
 ORC & Parquet with standard compression
 GZIP and ZLIB
 ORC with Snappy compression
 Parquet with Snappy compression
Thursday, April 19,
2018
11

FILE FORMATS Spark Scala file formats with Snappy compression on Cluster with 1TB

FILE FORMATS
 Parquet is up to 50% faster than text
 Standard compressions – GZIP and ZLIB
 Parquet is up 16% faster than ORC
 Snappy compression (faster than standard
compression)
 On average Parquet with Snappy is 10% faster than ORC
with Snappy compression
 Only common compression
Thursday, April 19,
2018
13

TAKEAWAY
 File Formats and compression benefits the
performance of all queries and both benchmarks
equally
 ORC & Parquet perform overall best with Snappy
 Parquet with Snappy compression is the best
choice
Thursday, April 19,
2018
14

Thursday, April 19,
2018
15
DATA
ACCESS

TPC-H
BENCHMARK
RESULTS
Thursday, April 19,
2018
16

TPC-H
BENCHMARK
RESULTS
Query Spark Scala (sec) Spark SQL (sec) Difference (%)
Q2 78 83 7%
Q4 73 100 26%
Q5 126 99 27%
Q7 111 94 18%
Q8 99 83 20%
Q11 83 68 21%
Q14 54 64 15%
Q15 69 80 14%
Q18 103 123 16%
Q19 60 80 25%
Q21 262 221 18%
Thursday, April 19,
2018
17

TAKEAWAY
 Spark Scala does not outperform Spark SQL
 Spark Scala and Spark SQL process queries
differently
 Are the applied optimization rules the same?
 Hive Metastore does not improve the performance,
but creates a minor overhead
 Possibility to improve performance by simply
switching API
Thursday, April 19,
2018
18

WHAT TO DO?
1. Is there a pattern?
 When to use Spark Scala?
 When to use Spark SQL?
2. What are the root causes?
Thursday, April 19,
2018
19

QUERY ANALYSIS
 2 approaches to investigate the performance differences identified:
1. Correlation analysis based on the Choke Point Analysis
2. Investigation of the Execution Plan

CHOKE POINT
ANALYSIS
 Classifying each TPC-H benchmark query into 6
categories (Low/Medium/High):
 Aggregation Performance
 Join Performance
 Data Access Locality
 Expression Calculation
 Correlated Subqueries
 Parallel Execution
 The correlation analysis is based on this
classification
* P. Boncz, T. Neumann, and O. Erling, “TPC-H Analyzed: Hidden Messages and
Lessons Learned from an Influential Benchmark,” in Performance Characterization
and Benchmarking, 2013, pp. 61–76 Thursday, April 19,
2018
21

CORRELATION ANALYSIS

SPARK SCALA – HIGH EXPRESSION CALCULATION

SPARK SQL – DATA ACCESS LOCALITY & PARALLEL EXECUTION

TAKEAWAY
 Spark Scala performs better in case of heavy
Expression Calculation
 Spark SQL is the better choice in case of
strong Data Access Locality in combination
with heavyweight Parallel Execution
Thursday, April 19,
2018
25

EXECUTION
PLAN ANALYSIS
 Execution Plan Analysis revealed different applied
optimizations
 Spark SQL and Spark Scala do have different physical
plans
 Query Q4, Q5, Q11, Q19 exemplify most substantial
Execution Plan variations:
 Different Joins
 Different Join order
 Different Join build side
 Missing filters
 Missing projection
Thursday, April 19,
2018
26
Not explicitly defined, but
applied for one API but not the
other.

QUERY ANALYSIS – Q11
 TPC-H query Q11 demonstrates bad performance for Spark Scala
 Performance differences can be tracked down to different applied joins
 Wrong build side for joins
QUERY 11
Spark Scala Spark SQL
1 x BroadCastHash
2 x SortMerge
1 x
BroadCastNestedLoop
4 x BroadCastHash
Bad performance Good performance
Join Type Complexity
BroadCastHash O(N)
SortMerge O(N Log N), if not
sorted
BoradCastNestedLoop O(N²)

SUMMARY
 Up to 30% of performance increase by simply
switching API
 Parquet with Snappy is best
 Spark API’s can be intermixed seamlessly, but
 differences in the execution plan
 no guarantee for best performance
 Different optimization rules are applied
 Spark SQL uses the Catalyst Optimizer
Thursday, April 19,
2018
28

THANK YOU
RAPHAEL RADOWITZ @DATAWORKS SUMMIT, BERLIN 19TH APRIL 2018
M.SC. Raphael Radowitz
Contact Detail
Phone: +82 (0) 10 9174 3788
Email: rradowitz@outlook.de

Evaluation of TPC-H on Spark and Spark SQL in ALOJA

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Evaluation of TPC-H on Spark and Spark SQL in ALOJA

Similar to Evaluation of TPC-H on Spark and Spark SQL in ALOJA (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Evaluation of TPC-H on Spark and Spark SQL in ALOJA