AN EVALUATION OF TPC-H
ON SPARK & SPARK SQL IN ALOJA
M.SC. RAPHAEL RADOWITZ @DATAWORKS SUMMIT, BERLIN 19TH APRIL 2018
FRANKFURT BIG DATA LAB @GOETHE UNIVERSITY
AGENDA
ļ‚” Motivation & Research Objectives
ļ‚” Spark
ļ‚” Ecosystem
ļ‚” Data Access
ļ‚” ALOJA & TPC-H
ļ‚” Spark SQL with or without Hive Metastore
ļ‚” File Formats
ļ‚” Correlation Analysis
ļ‚” Query Analysis
ļ‚” Summary
Thursday, April 19, 2018 2
SPARK SCALA & SPARK SQL
Do you Want to improve your Apache Spark
performance?
Thursday, April 19, 2018 3
QUESTION'S ADDRESSED IN THIS SESSION
1. Should I use Spark Scala or Spark SQL?
2. Does Hive Metastore have an impact on the performance?
3. Should I consider a certain File Format?
ļ‚” Master thesis: ā€œEvaluation of TPC-H on Spark & Spark SQL in ALOJAā€
Thursday, April 19, 2018 4
OUTCOME OF THE PERFORMANCE EVALUATION
1. Up to 30% of performance increase by switching between Spark Scala &
Spark SQL
2. Hive Metastore produces an overhead
3. File Format and compression increases performance
ļ‚” Parquet with Snappy compression is the best choice
ļ‚” Performance Evaluation conducted on Spark 2.1.1
Thursday, April 19, 2018 5
MOTIVATION & RESEARCH OBJECTIVES
ļ‚” Absence of a comprehensive performance evaluation of
Spark SQL compared to Spark Scala
ļ‚” Investigating the performance impact of Spark SQL and Spark Scala
ļ‚” Investigating the influence of Hive’s Metastore on performance
ļ‚” The attempt to detect possible bottlenecks in terms of runtime
ļ‚” Impact of various alternate file formats with different applied compressions
ļ‚” Implement a Spark Scala TPC-H benchmark within ALOJA
ļ‚” Benchmark is publicly accessible on GitHub
Thursday, April 19, 2018 6
ALOJA
ļ‚” Benchmark platform to characterize cost-effectiveness of Big Data
deployments
ļ‚” https://aloja.bsc.es/
ļ‚” https://github.com/Aloja/aloja
ļ‚” Collaboration with the Barcelona Super Computer Center (BSC)
ļ‚” Nicolas Poggi
ļ‚” Alejandro Montero
Thursday, April 19, 2018 7
TPC-H BENCHMARK
ļ‚” Popular decision support benchmark
ļ‚” Composed of eight different sized tables
ļ‚” 22 complex business oriented ad-hoc queries
Thursday, April 19, 2018 8
SPARK ECOSYSTEM / INTERFACES
Thursday, April 19, 2018 9
https://pages.databricks.com/rs/094-YMS-629/images/SparkSQLSigmod2015.pdf
Thursday, April 19,
2018
10
ļ‚” Data access from Spark on HDFS
ļ‚” With or without Metastore
ļ‚” Data File Formats: Text, ORC & Parquet
ļ‚” Dataset API
DATA
ACCESS
FILE FORMATS
ļ‚” Text
ļ‚” ORC & Parquet with standard compression
ļ‚” GZIP and ZLIB
ļ‚” ORC with Snappy compression
ļ‚” Parquet with Snappy compression
Thursday, April 19,
2018
11
FILE FORMATS Spark Scala file formats with Snappy compression on Cluster with 1TB
Thursday, April 19, 2018 12
FILE FORMATS
ļ‚” Parquet is up to 50% faster than text
ļ‚” Standard compressions – GZIP and ZLIB
ļ‚” Parquet is up 16% faster than ORC
ļ‚” Snappy compression (faster than standard
compression)
ļ‚” On average Parquet with Snappy is 10% faster than ORC
with Snappy compression
ļ‚” Only common compression
Thursday, April 19,
2018
13
TAKEAWAY
ļ‚” File Formats and compression benefits the
performance of all queries and both benchmarks
equally
ļ‚” ORC & Parquet perform overall best with Snappy
ļ‚” Parquet with Snappy compression is the best
choice
Thursday, April 19,
2018
14
Thursday, April 19,
2018
15
DATA
ACCESS
TPC-H
BENCHMARK
RESULTS
Thursday, April 19,
2018
16
TPC-H
BENCHMARK
RESULTS
Query Spark Scala (sec) Spark SQL (sec) Difference (%)
Q2 78 83 7%
Q4 73 100 26%
Q5 126 99 27%
Q7 111 94 18%
Q8 99 83 20%
Q11 83 68 21%
Q14 54 64 15%
Q15 69 80 14%
Q18 103 123 16%
Q19 60 80 25%
Q21 262 221 18%
Thursday, April 19,
2018
17
TAKEAWAY
ļ‚” Spark Scala does not outperform Spark SQL
ļ‚” Spark Scala and Spark SQL process queries
differently
ļ‚” Are the applied optimization rules the same?
ļ‚” Hive Metastore does not improve the performance,
but creates a minor overhead
ļ‚” Possibility to improve performance by simply
switching API
Thursday, April 19,
2018
18
WHAT TO DO?
1. Is there a pattern?
ļ‚” When to use Spark Scala?
ļ‚” When to use Spark SQL?
2. What are the root causes?
Thursday, April 19,
2018
19
QUERY ANALYSIS
ļ‚” 2 approaches to investigate the performance differences identified:
1. Correlation analysis based on the Choke Point Analysis
2. Investigation of the Execution Plan
Thursday, April 19, 2018 20
CHOKE POINT
ANALYSIS
ļ‚” Classifying each TPC-H benchmark query into 6
categories (Low/Medium/High):
ļ‚” Aggregation Performance
ļ‚” Join Performance
ļ‚” Data Access Locality
ļ‚” Expression Calculation
ļ‚” Correlated Subqueries
ļ‚” Parallel Execution
ļ‚” The correlation analysis is based on this
classification
* P. Boncz, T. Neumann, and O. Erling, ā€œTPC-H Analyzed: Hidden Messages and
Lessons Learned from an Influential Benchmark,ā€ in Performance Characterization
and Benchmarking, 2013, pp. 61–76 Thursday, April 19,
2018
21
CORRELATION ANALYSIS
Thursday, April 19, 2018 22
SPARK SCALA – HIGH EXPRESSION CALCULATION
Thursday, April 19, 2018 23
SPARK SQL – DATA ACCESS LOCALITY & PARALLEL EXECUTION
Thursday, April 19, 2018 24
TAKEAWAY
ļ‚” Spark Scala performs better in case of heavy
Expression Calculation
ļ‚” Spark SQL is the better choice in case of
strong Data Access Locality in combination
with heavyweight Parallel Execution
Thursday, April 19,
2018
25
EXECUTION
PLAN ANALYSIS
ļ‚” Execution Plan Analysis revealed different applied
optimizations
ļ‚” Spark SQL and Spark Scala do have different physical
plans
ļ‚” Query Q4, Q5, Q11, Q19 exemplify most substantial
Execution Plan variations:
ļ‚” Different Joins
ļ‚” Different Join order
ļ‚” Different Join build side
ļ‚” Missing filters
ļ‚” Missing projection
Thursday, April 19,
2018
26
Not explicitly defined, but
applied for one API but not the
other.
QUERY ANALYSIS – Q11
ļ‚” TPC-H query Q11 demonstrates bad performance for Spark Scala
ļ‚” Performance differences can be tracked down to different applied joins
ļ‚” Wrong build side for joins
QUERY 11
Spark Scala Spark SQL
1 x BroadCastHash
2 x SortMerge
1 x
BroadCastNestedLoop
4 x BroadCastHash
Bad performance Good performance
Join Type Complexity
BroadCastHash O(N)
SortMerge O(N Log N), if not
sorted
BoradCastNestedLoop O(N²)
Thursday, April 19, 2018 27
SUMMARY
ļ‚” Up to 30% of performance increase by simply
switching API
ļ‚” Parquet with Snappy is best
ļ‚” Spark API’s can be intermixed seamlessly, but
ļ‚” differences in the execution plan
ļ‚” no guarantee for best performance
ļ‚” Different optimization rules are applied
ļ‚” Spark SQL uses the Catalyst Optimizer
Thursday, April 19,
2018
28
THANK YOU
RAPHAEL RADOWITZ @DATAWORKS SUMMIT, BERLIN 19TH APRIL 2018
M.SC. Raphael Radowitz
Contact Detail
Phone: +82 (0) 10 9174 3788
Email: rradowitz@outlook.de

Evaluation of TPC-H on Spark and Spark SQL in ALOJA

  • 1.
    AN EVALUATION OFTPC-H ON SPARK & SPARK SQL IN ALOJA M.SC. RAPHAEL RADOWITZ @DATAWORKS SUMMIT, BERLIN 19TH APRIL 2018 FRANKFURT BIG DATA LAB @GOETHE UNIVERSITY
  • 2.
    AGENDA ļ‚” Motivation &Research Objectives ļ‚” Spark ļ‚” Ecosystem ļ‚” Data Access ļ‚” ALOJA & TPC-H ļ‚” Spark SQL with or without Hive Metastore ļ‚” File Formats ļ‚” Correlation Analysis ļ‚” Query Analysis ļ‚” Summary Thursday, April 19, 2018 2
  • 3.
    SPARK SCALA &SPARK SQL Do you Want to improve your Apache Spark performance? Thursday, April 19, 2018 3
  • 4.
    QUESTION'S ADDRESSED INTHIS SESSION 1. Should I use Spark Scala or Spark SQL? 2. Does Hive Metastore have an impact on the performance? 3. Should I consider a certain File Format? ļ‚” Master thesis: ā€œEvaluation of TPC-H on Spark & Spark SQL in ALOJAā€ Thursday, April 19, 2018 4
  • 5.
    OUTCOME OF THEPERFORMANCE EVALUATION 1. Up to 30% of performance increase by switching between Spark Scala & Spark SQL 2. Hive Metastore produces an overhead 3. File Format and compression increases performance ļ‚” Parquet with Snappy compression is the best choice ļ‚” Performance Evaluation conducted on Spark 2.1.1 Thursday, April 19, 2018 5
  • 6.
    MOTIVATION & RESEARCHOBJECTIVES ļ‚” Absence of a comprehensive performance evaluation of Spark SQL compared to Spark Scala ļ‚” Investigating the performance impact of Spark SQL and Spark Scala ļ‚” Investigating the influence of Hive’s Metastore on performance ļ‚” The attempt to detect possible bottlenecks in terms of runtime ļ‚” Impact of various alternate file formats with different applied compressions ļ‚” Implement a Spark Scala TPC-H benchmark within ALOJA ļ‚” Benchmark is publicly accessible on GitHub Thursday, April 19, 2018 6
  • 7.
    ALOJA ļ‚” Benchmark platformto characterize cost-effectiveness of Big Data deployments ļ‚” https://aloja.bsc.es/ ļ‚” https://github.com/Aloja/aloja ļ‚” Collaboration with the Barcelona Super Computer Center (BSC) ļ‚” Nicolas Poggi ļ‚” Alejandro Montero Thursday, April 19, 2018 7
  • 8.
    TPC-H BENCHMARK ļ‚” Populardecision support benchmark ļ‚” Composed of eight different sized tables ļ‚” 22 complex business oriented ad-hoc queries Thursday, April 19, 2018 8
  • 9.
    SPARK ECOSYSTEM /INTERFACES Thursday, April 19, 2018 9 https://pages.databricks.com/rs/094-YMS-629/images/SparkSQLSigmod2015.pdf
  • 10.
    Thursday, April 19, 2018 10 ļ‚”Data access from Spark on HDFS ļ‚” With or without Metastore ļ‚” Data File Formats: Text, ORC & Parquet ļ‚” Dataset API DATA ACCESS
  • 11.
    FILE FORMATS ļ‚” Text ļ‚”ORC & Parquet with standard compression ļ‚” GZIP and ZLIB ļ‚” ORC with Snappy compression ļ‚” Parquet with Snappy compression Thursday, April 19, 2018 11
  • 12.
    FILE FORMATS SparkScala file formats with Snappy compression on Cluster with 1TB Thursday, April 19, 2018 12
  • 13.
    FILE FORMATS ļ‚” Parquetis up to 50% faster than text ļ‚” Standard compressions – GZIP and ZLIB ļ‚” Parquet is up 16% faster than ORC ļ‚” Snappy compression (faster than standard compression) ļ‚” On average Parquet with Snappy is 10% faster than ORC with Snappy compression ļ‚” Only common compression Thursday, April 19, 2018 13
  • 14.
    TAKEAWAY ļ‚” File Formatsand compression benefits the performance of all queries and both benchmarks equally ļ‚” ORC & Parquet perform overall best with Snappy ļ‚” Parquet with Snappy compression is the best choice Thursday, April 19, 2018 14
  • 15.
  • 16.
  • 17.
    TPC-H BENCHMARK RESULTS Query Spark Scala(sec) Spark SQL (sec) Difference (%) Q2 78 83 7% Q4 73 100 26% Q5 126 99 27% Q7 111 94 18% Q8 99 83 20% Q11 83 68 21% Q14 54 64 15% Q15 69 80 14% Q18 103 123 16% Q19 60 80 25% Q21 262 221 18% Thursday, April 19, 2018 17
  • 18.
    TAKEAWAY ļ‚” Spark Scaladoes not outperform Spark SQL ļ‚” Spark Scala and Spark SQL process queries differently ļ‚” Are the applied optimization rules the same? ļ‚” Hive Metastore does not improve the performance, but creates a minor overhead ļ‚” Possibility to improve performance by simply switching API Thursday, April 19, 2018 18
  • 19.
    WHAT TO DO? 1.Is there a pattern? ļ‚” When to use Spark Scala? ļ‚” When to use Spark SQL? 2. What are the root causes? Thursday, April 19, 2018 19
  • 20.
    QUERY ANALYSIS ļ‚” 2approaches to investigate the performance differences identified: 1. Correlation analysis based on the Choke Point Analysis 2. Investigation of the Execution Plan Thursday, April 19, 2018 20
  • 21.
    CHOKE POINT ANALYSIS ļ‚” Classifyingeach TPC-H benchmark query into 6 categories (Low/Medium/High): ļ‚” Aggregation Performance ļ‚” Join Performance ļ‚” Data Access Locality ļ‚” Expression Calculation ļ‚” Correlated Subqueries ļ‚” Parallel Execution ļ‚” The correlation analysis is based on this classification * P. Boncz, T. Neumann, and O. Erling, ā€œTPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark,ā€ in Performance Characterization and Benchmarking, 2013, pp. 61–76 Thursday, April 19, 2018 21
  • 22.
  • 23.
    SPARK SCALA –HIGH EXPRESSION CALCULATION Thursday, April 19, 2018 23
  • 24.
    SPARK SQL –DATA ACCESS LOCALITY & PARALLEL EXECUTION Thursday, April 19, 2018 24
  • 25.
    TAKEAWAY ļ‚” Spark Scalaperforms better in case of heavy Expression Calculation ļ‚” Spark SQL is the better choice in case of strong Data Access Locality in combination with heavyweight Parallel Execution Thursday, April 19, 2018 25
  • 26.
    EXECUTION PLAN ANALYSIS ļ‚” ExecutionPlan Analysis revealed different applied optimizations ļ‚” Spark SQL and Spark Scala do have different physical plans ļ‚” Query Q4, Q5, Q11, Q19 exemplify most substantial Execution Plan variations: ļ‚” Different Joins ļ‚” Different Join order ļ‚” Different Join build side ļ‚” Missing filters ļ‚” Missing projection Thursday, April 19, 2018 26 Not explicitly defined, but applied for one API but not the other.
  • 27.
    QUERY ANALYSIS –Q11 ļ‚” TPC-H query Q11 demonstrates bad performance for Spark Scala ļ‚” Performance differences can be tracked down to different applied joins ļ‚” Wrong build side for joins QUERY 11 Spark Scala Spark SQL 1 x BroadCastHash 2 x SortMerge 1 x BroadCastNestedLoop 4 x BroadCastHash Bad performance Good performance Join Type Complexity BroadCastHash O(N) SortMerge O(N Log N), if not sorted BoradCastNestedLoop O(N²) Thursday, April 19, 2018 27
  • 28.
    SUMMARY ļ‚” Up to30% of performance increase by simply switching API ļ‚” Parquet with Snappy is best ļ‚” Spark API’s can be intermixed seamlessly, but ļ‚” differences in the execution plan ļ‚” no guarantee for best performance ļ‚” Different optimization rules are applied ļ‚” Spark SQL uses the Catalyst Optimizer Thursday, April 19, 2018 28
  • 29.
    THANK YOU RAPHAEL RADOWITZ@DATAWORKS SUMMIT, BERLIN 19TH APRIL 2018 M.SC. Raphael Radowitz Contact Detail Phone: +82 (0) 10 9174 3788 Email: rradowitz@outlook.de