An Insider’s Guide to Maximizing Spark SQL Performance

An Insider’s Guide to
Maximizing Spark SQL
Performance
Xiao Li @ gatorsmile
Japanese Hadoop/Spark Conf @ Tokyo | Mar 2019
1

About Me
• Engineering Manager at Databricks
• Apache Spark Committer and PMC Member
• Previously, IBM Master Inventor
• Spark SQL, Database Replication, Information Integration
• Ph.D. in University of Florida
• Github: gatorsmile

Databricks Customers Across Industries
Financial Services Healthcare & Pharma Media & Entertainment Technology
Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech
Data & Analytics Services

DATABRICKS WORKSPACE
Databricks Delta ML Frameworks
DATABRICKS CLOUD SERVICE
DATABRICKS RUNTIME
Reliable & Scalable Simple & Integrated
Databricks Unified Analytics Platform
APIs
Jobs
Models
Notebooks
Dashboards End to end ML lifecycle

Apache Spark 3.x
5
Catalyst Optimization & Tungsten Execution
SparkSession / DataFrame / DataSet APIs
SQL
Spark ML
Spark
Streaming
Spark
Graph
3rd-party
Libraries
Spark CoreData Source Connectors

Apache Spark 3.x
6
Catalyst Optimization & Tungsten Execution
SparkSession / DataFrame / DataSet APIs
SQL
Spark ML
Spark
Streaming
Spark
Graph
3rd-party
Libraries
Spark CoreData Source Connectors

From declarative queries to RDDs
7
SQL
Dataset
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan DAGsPhysical
Plans
Selected
Physical Plan
CostModel
Metadata
Catalog
Cache
Manager
Analyzer Optimizer Planner
Query
Execution
Parser
RDD
Results

9
Read Plan.
Interpret Plan.
Tune Plan.
Track Execution.

Read Plans from SQL
Tab in either Spark UI
or Spark History Server
10
Spark 3.0: Show the actual SQL statement? [SPARK-27045]

Page: In Details for SQL Query
11

12
Parsed Plan
Analyzed Plan
Optimized Plan
Physical Plan

15
Read the analyzed
plan to check the
implicit type
casting.
Tip:
Explicitly cast the
types in the queries.

16
Read the analyzed
plan to check the
implicit type
casting.
Tip:
Explicitly cast the
types in the queries.

Hive-serde Tables
17
Syntax to create a Hive Serde table

19
filter pushdown
Native
reader/writer
performs faster
than Hive serde
reader/writer

20
Native Data Source Tables
Syntax to create a Spark native ORC table
Tip:
Create native
data source
tables for better
performance
and stability.

21
Push Down + Implicit Type Casting
Not pushed down???
Tip:
Cast is needed?
Update the
constants?

Nested Schema Pruning
22Not pruned???

Collapse Projects
24
Call UDF three times!!!

Cross-session SQL Cache
26
• If a query is cached in the one session, the new
queries in all the sessions might be impacted.
• Check your query plan!

From
SQL query
to
Spark Jobs
29

30
• A SQL query => multiple Spark jobs.
• For example, broadcast exchange, shuffle
exchange, non-correlated subquery, lazily cached
queries.
• A Spark job => A DAG
• A chain of RDD dependencies organized in a
directed acyclic graph (DAG)

31
The higher
level SQL
physical
operators.
Optimized
ogical Plan DAGsPhysical
Plans
Selected
Physical Plan
CostModel
he
ger
r Planner
Query
ExecutionQuery Execution
The low
level Spark
RDD
primitives.

Job Tab in Spark UI
32
The amount of time for each job.
Any stage/task failure?

Job Tab
33
The amount of time for each stage.
• Jobs
• Stages
• Tasks

Stages Tab
34
• How the time are spent?
• Any outlier in task execution?
• Straggler tasks?
• Skew in data size, compute time?
• Too many/few tasks (partitions)?
• Load balanced? Locality?
Tasks specific info

35
Balanced? Skew?
Killed?
Which executor’s
log we should
read?

Executors Tab
36
size of data transferred
between stages
used/available memory
All the problematic executors in the same node?

37
- Interacting with Hive metastore?
- Slow query planning?

Storage Tab
38
Tip:
Caching query
results might be
slower if your
memory is not
large enough.

Storage Tab
39
Tip:
Unpersist
the cache
if not used.

Additional Resources
40
• Apache Spark document:
https://spark.apache.org/docs/latest/sql-programming-
guide.html
• Blog: https://databricks.com/blog/category/engineering/spark
• Previous summit: https://databricks.com/sparkaisummit/north-
america/sessions
• Databricks document: https://docs.databricks.com/
• Books: https://www.amazon.com/s?k=apache+spark
• Databricks academy: https://academy.databricks.com
• Databricks ebooks: https://databricks.com/resources/type/ebooks

Apache Spark™
• Use Cases
• Research
• Technical Deep Dives
AI
• Productionizing ML
• Deep Learning
• Cloud Hardware
Fields
• Data Science
• Data Engineering
• Enterprise
5000+ ATTENDEES
Practitioners:
Data Scientists, Data Engineers,
Analysts, Architects
Leaders:
Engineering Management, VPs,
Heads of Analytics & Data, CxOs
TRACKS
databricks.com/sparkaisummit

42
Nike: Enabling Data Scientists to bring their Models to Market
Facebook: Vectorized Query Execution in Apache Spark at Facebook
Tencent: Large-scale Malicious Domain Detection with Spark AI
IBM: In-memory storage Evolution in Apache Spark
Capital One: Apache Spark and Sights at Speed: Streaming, Feature
management and Execution
Apple: Making Nested Columns as First Citizen in Apache Spark SQL
EBay: Managing Apache Spark workload and automatic optimizing.
Google: Validating Spark ML Jobs
HP: Apache Spark for Cyber Security in big company
Microsoft: Apache Spark Serving: Unifying Batch, Streaming and
RESTful Serving
ABSA Group: A Mainframe Data Source for Spark SQL and Streaming
Facebook: an efficient Facebook-scale shuffle service
IBM: Make your PySpark Data Fly with Arrow!
Facebook : Distributed Scheduling Framework for Apache Spark
Zynga: Automating Predictive Modeling at Zynga with PySpark
World Bank: Using Crowdsourced Images to Create Image Recognition
Models and NLP to Augment Global Trade indicator
JD.com: Optimizing Performance and Computing Resource.
Microsoft: Azure Databricks with R: Deep Dive
Airbnb: Apache Spark at Airbnb
Netflix: Migrating to Apache Spark at Netflix
Microsoft: Infrastructure for Deep Learning in
Apache Spark
Intel: Game playing using AI on Apache Spark
Facebook: Scaling Apache Spark @ Facebook
Lyft: Scaling Apache Spark on K8S at Lyft
Uber: Using Spark Mllib Models in a Production
Training and Serving Platform
Apple: Bridging the gap between Datasets and
DataFrames
Salesforce: The Rule of 10,000 Spark Jobs
Target: Lessons in Linear Algebra at Scale with
Apache Spark
ICL: Cooperative Task Execution for Apache Spark
Workday: Lesson Learned Using Apache Spark

Thank you
Xiao Li (lixiao@databricks.com)
43

An Insider’s Guide to Maximizing Spark SQL Performance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to An Insider’s Guide to Maximizing Spark SQL Performance

Similar to An Insider’s Guide to Maximizing Spark SQL Performance (20)

More from Takuya UESHIN

More from Takuya UESHIN (10)

Recently uploaded

Recently uploaded (20)

An Insider’s Guide to Maximizing Spark SQL Performance