SlideShare a Scribd company logo
1 of 29
AN EVALUATION OF TPC-H
ON SPARK & SPARK SQL IN ALOJA
M.SC. RAPHAEL RADOWITZ @DATAWORKS SUMMIT, BERLIN 19TH APRIL 2018
FRANKFURT BIG DATA LAB @GOETHE UNIVERSITY
AGENDA
 Motivation & Research Objectives
 Spark
 Ecosystem
 Data Access
 ALOJA & TPC-H
 Spark SQL with or without Hive Metastore
 File Formats
 Correlation Analysis
 Query Analysis
 Summary
Thursday, April 19, 2018 2
SPARK SCALA & SPARK SQL
Do you Want to improve your Apache Spark
performance?
Thursday, April 19, 2018 3
QUESTION'S ADDRESSED IN THIS SESSION
1. Should I use Spark Scala or Spark SQL?
2. Does Hive Metastore have an impact on the performance?
3. Should I consider a certain File Format?
 Master thesis: “Evaluation of TPC-H on Spark & Spark SQL in ALOJA”
Thursday, April 19, 2018 4
OUTCOME OF THE PERFORMANCE EVALUATION
1. Up to 30% of performance increase by switching between Spark Scala &
Spark SQL
2. Hive Metastore produces an overhead
3. File Format and compression increases performance
 Parquet with Snappy compression is the best choice
 Performance Evaluation conducted on Spark 2.1.1
Thursday, April 19, 2018 5
MOTIVATION & RESEARCH OBJECTIVES
 Absence of a comprehensive performance evaluation of
Spark SQL compared to Spark Scala
 Investigating the performance impact of Spark SQL and Spark Scala
 Investigating the influence of Hive’s Metastore on performance
 The attempt to detect possible bottlenecks in terms of runtime
 Impact of various alternate file formats with different applied compressions
 Implement a Spark Scala TPC-H benchmark within ALOJA
 Benchmark is publicly accessible on GitHub
Thursday, April 19, 2018 6
ALOJA
 Benchmark platform to characterize cost-effectiveness of Big Data
deployments
 https://aloja.bsc.es/
 https://github.com/Aloja/aloja
 Collaboration with the Barcelona Super Computer Center (BSC)
 Nicolas Poggi
 Alejandro Montero
Thursday, April 19, 2018 7
TPC-H BENCHMARK
 Popular decision support benchmark
 Composed of eight different sized tables
 22 complex business oriented ad-hoc queries
Thursday, April 19, 2018 8
SPARK ECOSYSTEM / INTERFACES
Thursday, April 19, 2018 9
https://pages.databricks.com/rs/094-YMS-629/images/SparkSQLSigmod2015.pdf
Thursday, April 19,
2018
10
 Data access from Spark on HDFS
 With or without Metastore
 Data File Formats: Text, ORC & Parquet
 Dataset API
DATA
ACCESS
FILE FORMATS
 Text
 ORC & Parquet with standard compression
 GZIP and ZLIB
 ORC with Snappy compression
 Parquet with Snappy compression
Thursday, April 19,
2018
11
FILE FORMATS Spark Scala file formats with Snappy compression on Cluster with 1TB
Thursday, April 19, 2018 12
FILE FORMATS
 Parquet is up to 50% faster than text
 Standard compressions – GZIP and ZLIB
 Parquet is up 16% faster than ORC
 Snappy compression (faster than standard
compression)
 On average Parquet with Snappy is 10% faster than ORC
with Snappy compression
 Only common compression
Thursday, April 19,
2018
13
TAKEAWAY
 File Formats and compression benefits the
performance of all queries and both benchmarks
equally
 ORC & Parquet perform overall best with Snappy
 Parquet with Snappy compression is the best
choice
Thursday, April 19,
2018
14
Thursday, April 19,
2018
15
DATA
ACCESS
TPC-H
BENCHMARK
RESULTS
Thursday, April 19,
2018
16
TPC-H
BENCHMARK
RESULTS
Query Spark Scala (sec) Spark SQL (sec) Difference (%)
Q2 78 83 7%
Q4 73 100 26%
Q5 126 99 27%
Q7 111 94 18%
Q8 99 83 20%
Q11 83 68 21%
Q14 54 64 15%
Q15 69 80 14%
Q18 103 123 16%
Q19 60 80 25%
Q21 262 221 18%
Thursday, April 19,
2018
17
TAKEAWAY
 Spark Scala does not outperform Spark SQL
 Spark Scala and Spark SQL process queries
differently
 Are the applied optimization rules the same?
 Hive Metastore does not improve the performance,
but creates a minor overhead
 Possibility to improve performance by simply
switching API
Thursday, April 19,
2018
18
WHAT TO DO?
1. Is there a pattern?
 When to use Spark Scala?
 When to use Spark SQL?
2. What are the root causes?
Thursday, April 19,
2018
19
QUERY ANALYSIS
 2 approaches to investigate the performance differences identified:
1. Correlation analysis based on the Choke Point Analysis
2. Investigation of the Execution Plan
Thursday, April 19, 2018 20
CHOKE POINT
ANALYSIS
 Classifying each TPC-H benchmark query into 6
categories (Low/Medium/High):
 Aggregation Performance
 Join Performance
 Data Access Locality
 Expression Calculation
 Correlated Subqueries
 Parallel Execution
 The correlation analysis is based on this
classification
* P. Boncz, T. Neumann, and O. Erling, “TPC-H Analyzed: Hidden Messages and
Lessons Learned from an Influential Benchmark,” in Performance Characterization
and Benchmarking, 2013, pp. 61–76 Thursday, April 19,
2018
21
CORRELATION ANALYSIS
Thursday, April 19, 2018 22
SPARK SCALA – HIGH EXPRESSION CALCULATION
Thursday, April 19, 2018 23
SPARK SQL – DATA ACCESS LOCALITY & PARALLEL EXECUTION
Thursday, April 19, 2018 24
TAKEAWAY
 Spark Scala performs better in case of heavy
Expression Calculation
 Spark SQL is the better choice in case of
strong Data Access Locality in combination
with heavyweight Parallel Execution
Thursday, April 19,
2018
25
EXECUTION
PLAN ANALYSIS
 Execution Plan Analysis revealed different applied
optimizations
 Spark SQL and Spark Scala do have different physical
plans
 Query Q4, Q5, Q11, Q19 exemplify most substantial
Execution Plan variations:
 Different Joins
 Different Join order
 Different Join build side
 Missing filters
 Missing projection
Thursday, April 19,
2018
26
Not explicitly defined, but
applied for one API but not the
other.
QUERY ANALYSIS – Q11
 TPC-H query Q11 demonstrates bad performance for Spark Scala
 Performance differences can be tracked down to different applied joins
 Wrong build side for joins
QUERY 11
Spark Scala Spark SQL
1 x BroadCastHash
2 x SortMerge
1 x
BroadCastNestedLoop
4 x BroadCastHash
Bad performance Good performance
Join Type Complexity
BroadCastHash O(N)
SortMerge O(N Log N), if not
sorted
BoradCastNestedLoop O(N²)
Thursday, April 19, 2018 27
SUMMARY
 Up to 30% of performance increase by simply
switching API
 Parquet with Snappy is best
 Spark API’s can be intermixed seamlessly, but
 differences in the execution plan
 no guarantee for best performance
 Different optimization rules are applied
 Spark SQL uses the Catalyst Optimizer
Thursday, April 19,
2018
28
THANK YOU
RAPHAEL RADOWITZ @DATAWORKS SUMMIT, BERLIN 19TH APRIL 2018
M.SC. Raphael Radowitz
Contact Detail
Phone: +82 (0) 10 9174 3788
Email: rradowitz@outlook.de

More Related Content

What's hot

Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Apache Spark.
Apache Spark.Apache Spark.
Apache Spark.JananiJ19
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 

What's hot (20)

Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Apache Spark.
Apache Spark.Apache Spark.
Apache Spark.
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark
SparkSpark
Spark
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on Kubernetes
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 

Similar to Spark SQL Beats Spark Scala by 30% for Some Queries

Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...DataBench
 
Exploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured StreamingExploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured Streamingt_ivanov
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache HivemallMakoto Yui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Edureka!
 
FPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWSFPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWSChristoforos Kachris
 
ABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack BenchmarkABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack Benchmarkt_ivanov
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data ValidationDatabricks
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!Edureka!
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Knoldus Inc.
 
Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Edureka!
 
Migrating PostgreSQL to the Cloud
Migrating PostgreSQL to the CloudMigrating PostgreSQL to the Cloud
Migrating PostgreSQL to the CloudMike Fowler
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 

Similar to Spark SQL Beats Spark Scala by 30% for Some Queries (20)

Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
 
Exploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured StreamingExploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured Streaming
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache Hivemall
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?
 
Industrialiser spark
Industrialiser sparkIndustrialiser spark
Industrialiser spark
 
FPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWSFPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWS
 
ABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack BenchmarkABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack Benchmark
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data Validation
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
SAIS/DWS2018報告会 #saisdws2018
SAIS/DWS2018報告会 #saisdws2018SAIS/DWS2018報告会 #saisdws2018
SAIS/DWS2018報告会 #saisdws2018
 
Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why
 
Migrating PostgreSQL to the Cloud
Migrating PostgreSQL to the CloudMigrating PostgreSQL to the Cloud
Migrating PostgreSQL to the Cloud
 
What's new in Spark 2.0?
What's new in Spark 2.0?What's new in Spark 2.0?
What's new in Spark 2.0?
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
2016 spark survey
2016 spark survey2016 spark survey
2016 spark survey
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
AI at Scale
AI at ScaleAI at Scale
AI at Scale
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Spark SQL Beats Spark Scala by 30% for Some Queries

  • 1. AN EVALUATION OF TPC-H ON SPARK & SPARK SQL IN ALOJA M.SC. RAPHAEL RADOWITZ @DATAWORKS SUMMIT, BERLIN 19TH APRIL 2018 FRANKFURT BIG DATA LAB @GOETHE UNIVERSITY
  • 2. AGENDA  Motivation & Research Objectives  Spark  Ecosystem  Data Access  ALOJA & TPC-H  Spark SQL with or without Hive Metastore  File Formats  Correlation Analysis  Query Analysis  Summary Thursday, April 19, 2018 2
  • 3. SPARK SCALA & SPARK SQL Do you Want to improve your Apache Spark performance? Thursday, April 19, 2018 3
  • 4. QUESTION'S ADDRESSED IN THIS SESSION 1. Should I use Spark Scala or Spark SQL? 2. Does Hive Metastore have an impact on the performance? 3. Should I consider a certain File Format?  Master thesis: “Evaluation of TPC-H on Spark & Spark SQL in ALOJA” Thursday, April 19, 2018 4
  • 5. OUTCOME OF THE PERFORMANCE EVALUATION 1. Up to 30% of performance increase by switching between Spark Scala & Spark SQL 2. Hive Metastore produces an overhead 3. File Format and compression increases performance  Parquet with Snappy compression is the best choice  Performance Evaluation conducted on Spark 2.1.1 Thursday, April 19, 2018 5
  • 6. MOTIVATION & RESEARCH OBJECTIVES  Absence of a comprehensive performance evaluation of Spark SQL compared to Spark Scala  Investigating the performance impact of Spark SQL and Spark Scala  Investigating the influence of Hive’s Metastore on performance  The attempt to detect possible bottlenecks in terms of runtime  Impact of various alternate file formats with different applied compressions  Implement a Spark Scala TPC-H benchmark within ALOJA  Benchmark is publicly accessible on GitHub Thursday, April 19, 2018 6
  • 7. ALOJA  Benchmark platform to characterize cost-effectiveness of Big Data deployments  https://aloja.bsc.es/  https://github.com/Aloja/aloja  Collaboration with the Barcelona Super Computer Center (BSC)  Nicolas Poggi  Alejandro Montero Thursday, April 19, 2018 7
  • 8. TPC-H BENCHMARK  Popular decision support benchmark  Composed of eight different sized tables  22 complex business oriented ad-hoc queries Thursday, April 19, 2018 8
  • 9. SPARK ECOSYSTEM / INTERFACES Thursday, April 19, 2018 9 https://pages.databricks.com/rs/094-YMS-629/images/SparkSQLSigmod2015.pdf
  • 10. Thursday, April 19, 2018 10  Data access from Spark on HDFS  With or without Metastore  Data File Formats: Text, ORC & Parquet  Dataset API DATA ACCESS
  • 11. FILE FORMATS  Text  ORC & Parquet with standard compression  GZIP and ZLIB  ORC with Snappy compression  Parquet with Snappy compression Thursday, April 19, 2018 11
  • 12. FILE FORMATS Spark Scala file formats with Snappy compression on Cluster with 1TB Thursday, April 19, 2018 12
  • 13. FILE FORMATS  Parquet is up to 50% faster than text  Standard compressions – GZIP and ZLIB  Parquet is up 16% faster than ORC  Snappy compression (faster than standard compression)  On average Parquet with Snappy is 10% faster than ORC with Snappy compression  Only common compression Thursday, April 19, 2018 13
  • 14. TAKEAWAY  File Formats and compression benefits the performance of all queries and both benchmarks equally  ORC & Parquet perform overall best with Snappy  Parquet with Snappy compression is the best choice Thursday, April 19, 2018 14
  • 17. TPC-H BENCHMARK RESULTS Query Spark Scala (sec) Spark SQL (sec) Difference (%) Q2 78 83 7% Q4 73 100 26% Q5 126 99 27% Q7 111 94 18% Q8 99 83 20% Q11 83 68 21% Q14 54 64 15% Q15 69 80 14% Q18 103 123 16% Q19 60 80 25% Q21 262 221 18% Thursday, April 19, 2018 17
  • 18. TAKEAWAY  Spark Scala does not outperform Spark SQL  Spark Scala and Spark SQL process queries differently  Are the applied optimization rules the same?  Hive Metastore does not improve the performance, but creates a minor overhead  Possibility to improve performance by simply switching API Thursday, April 19, 2018 18
  • 19. WHAT TO DO? 1. Is there a pattern?  When to use Spark Scala?  When to use Spark SQL? 2. What are the root causes? Thursday, April 19, 2018 19
  • 20. QUERY ANALYSIS  2 approaches to investigate the performance differences identified: 1. Correlation analysis based on the Choke Point Analysis 2. Investigation of the Execution Plan Thursday, April 19, 2018 20
  • 21. CHOKE POINT ANALYSIS  Classifying each TPC-H benchmark query into 6 categories (Low/Medium/High):  Aggregation Performance  Join Performance  Data Access Locality  Expression Calculation  Correlated Subqueries  Parallel Execution  The correlation analysis is based on this classification * P. Boncz, T. Neumann, and O. Erling, “TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark,” in Performance Characterization and Benchmarking, 2013, pp. 61–76 Thursday, April 19, 2018 21
  • 23. SPARK SCALA – HIGH EXPRESSION CALCULATION Thursday, April 19, 2018 23
  • 24. SPARK SQL – DATA ACCESS LOCALITY & PARALLEL EXECUTION Thursday, April 19, 2018 24
  • 25. TAKEAWAY  Spark Scala performs better in case of heavy Expression Calculation  Spark SQL is the better choice in case of strong Data Access Locality in combination with heavyweight Parallel Execution Thursday, April 19, 2018 25
  • 26. EXECUTION PLAN ANALYSIS  Execution Plan Analysis revealed different applied optimizations  Spark SQL and Spark Scala do have different physical plans  Query Q4, Q5, Q11, Q19 exemplify most substantial Execution Plan variations:  Different Joins  Different Join order  Different Join build side  Missing filters  Missing projection Thursday, April 19, 2018 26 Not explicitly defined, but applied for one API but not the other.
  • 27. QUERY ANALYSIS – Q11  TPC-H query Q11 demonstrates bad performance for Spark Scala  Performance differences can be tracked down to different applied joins  Wrong build side for joins QUERY 11 Spark Scala Spark SQL 1 x BroadCastHash 2 x SortMerge 1 x BroadCastNestedLoop 4 x BroadCastHash Bad performance Good performance Join Type Complexity BroadCastHash O(N) SortMerge O(N Log N), if not sorted BoradCastNestedLoop O(N²) Thursday, April 19, 2018 27
  • 28. SUMMARY  Up to 30% of performance increase by simply switching API  Parquet with Snappy is best  Spark API’s can be intermixed seamlessly, but  differences in the execution plan  no guarantee for best performance  Different optimization rules are applied  Spark SQL uses the Catalyst Optimizer Thursday, April 19, 2018 28
  • 29. THANK YOU RAPHAEL RADOWITZ @DATAWORKS SUMMIT, BERLIN 19TH APRIL 2018 M.SC. Raphael Radowitz Contact Detail Phone: +82 (0) 10 9174 3788 Email: rradowitz@outlook.de