Understanding Query Plans and Spark UIs

Understanding Query Plans
and Spark UIs
Xiao Li @ gatorsmile
Spark + AI Summit @ SF | April 2019
1

About Me
• Engineering Manager at Databricks
• Apache Spark Committer and PMC Member
• Previously, IBM Master Inventor
• Spark, Database Replication, Information Integration
• Ph.D. in University of Florida
• Github: gatorsmile

Databricks Customers Across Industries
Financial Services Healthcare & Pharma Media & Entertainment Technology
Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech
Data & Analytics Services

DATABRICKS WORKSPACE
Databricks Delta ML Frameworks
DATABRICKS CLOUD SERVICE
DATABRICKS RUNTIME
Reliable & Scalable Simple & Integrated
Databricks Unified Analytics Platform
APIs
Jobs
Models
Notebooks
Dashboards End to end ML lifecycle

Apache Spark 3.x
5
Catalyst Optimization & Tungsten Execution
SparkSession / DataFrame / DataSet APIs
SQL
Spark ML
Spark
Streaming
Spark
Graph
3rd-party
Libraries
Spark CoreData Source Connectors

Apache Spark 3.x
6
Catalyst Optimization & Tungsten Execution
SparkSession / DataFrame / DataSet APIs
SQL
Spark ML
Spark
Streaming
Spark
Graph
3rd-party
Libraries
Spark CoreData Source Connectors

From declarative queries to RDDs
7
Cypher

9
Read Plan.
Interpret Plan.
Tune Plan.
Track Execution.

10
Read Plans from
SQL Tab in either
Spark UI or Spark
History Server

Read Plans from
SQL Tab in either
Spark UI or Spark
History Server
11
Spark 3.0: Show the actual SQL statement? [SPARK-27045]

Page: In Details for SQL Query
12

13
Parsed
Plan
Analyzed
Plan
Optimized
Plan
Physical
Plan

16
Read the analyzed
plan to check the
implicit type
casting.
Tip:
Explicitly cast the
types in the queries.

17
Read the analyzed
plan to check the
implicit type
casting.
Tip:
Explicitly cast the
types in the queries.

Create Hive Tables
18
Syntax to create a Hive Serde table

20
filter pushdown
Native
reader/writer
performs faster
than Hive serde
reader/writer

21
Create Native Tables
Syntax to create a Spark native ORC table
Tip:
Create native
data source
tables for better
performance
and stability.

22
Push Down + Implicit Type Casting
Not pushed down???
Tip:
Cast is needed?
Update the
constants?

Nested Schema Pruning
23Not pruned???

Collapse Projects
25
Call UDF three times!!!

Cross-session SQL Cache
27
• If a query is cached in the one session, the new
queries in all the sessions might be impacted.
• Check your query plan!

29
Join Hints in Spark 3.0
• BROADCAST
• Broadcast Hash/Nested-loop Join
• MERGE
• Shuffle Sort Merge Join
• SHUFFLE_HASH
• Shuffle Hash Join
• SHUFFLE_REPLICATE_NL
• Shuffle-and-Replicate Nested Loop Join

From
SQL query
to
Spark Jobs
31

32
• A SQL query => multiple Spark jobs.
• - For example, broadcast exchange, shuffle
exchange, Scalar subquery.
• - External data sources: Delta Lake.
• - New adaptive query execution.
• A Spark job => A DAG
• A chain of RDD dependencies organized in a
directed acyclic graph (DAG)

33
The higher
level SQL
physical
operators.
Optimized
ogical Plan DAGsPhysical
Plans
Selected
Physical Plan
CostModel
he
ger
r Planner
Query
ExecutionQuery Execution
The low
level Spark
RDD
primitives.

Job Tab in Spark UI
34
The amount of time for each job.
Any stage/task failure?

Job Tab
35
The amount of time for each stage.
• Jobs
• Stages
• Tasks

Stages Tab
36
• How the time are spent?
• Any outlier in task execution?
• Straggler tasks?
• Skew in data size, compute time?
• Too many/few tasks (partitions)?
• Load balanced? Locality?
Tasks specific info

37
Balanced? Skew?
Killed?
Which
executor’s
log we
should read?

Executors Tab
38
size of data transferred
between stages
used/available memory
All the problematic executors in the same node?

39
- Interacting with Hive metastore?
- Slow query planning?
- Slow file listing?

40
Insert
Partitioned
Hive
Table OR “STORED AS PARQUET”
5000 partitions took
almost 8 minutes!!!

42
Insert
Partitioned
Native
Table
Reduced from almost 8 minutes
to less than 1 minute !!!

43
Insert
Partitioned
Delta
Table
Reduced from almost 8 minutes
to 27 seconds!!!

Typical Spark Performance Issues
44
The table has thousands of partitions
• Hive metastore overhead
This table can have 100s of thousands to millions of files
• File system overhead - listing takes forever!
New data is not immediately visible
• Need to invoke a command “Refresh Table” with the SQL
engine they were using
The above issues can add 10s of minutes to the response time!

Delta Lake + Spark
45
Scalable metadata handling @ Delta Lake
Store metadata in transaction log file instead of metastore
The table has thousands of partitions
• Zero Hive Metastore overhead
The table can have 100s of thousands to millions of files
• No file listing
New data is not immediately visible
• Delta table state is computed on read

How do I use Delta?
format(“parquet”) -> format(“delta”)

Delta Lake + Spark
47
• Full ACID transactions
• Schema management
• Data versioning and time travel
• Unified batch/streaming support
• Scalable metadata handling
• Record update and deletion
• Data expectation
Delta Lake: https://delta.io/
For details, refer to the blog
https://tinyurl.com/yxhbe2lg

Delta Usage Statistics
More than 1 exabyte
processed (1018 bytes)
monthly
ManufacturingPublic Sector Technology Other
Healthcare and Life Sciences Financial Services Media and Entertainment Retail, CPG, and eCommerce

Additional Resources
49
• Apache Spark document: https://spark.apache.org/docs/latest/sql-
programming-guide.html
• Blog: https://databricks.com/blog/category/engineering/spark
• Previous summit: https://databricks.com/sparkaisummit/north-
america/sessions
• Delta Lake document: https://docs.delta.io
• Databricks document: https://docs.databricks.com/
• Books: https://www.amazon.com/s?k=apache+spark
• Databricks academy: https://academy.databricks.com
• Databricks ebooks: https://databricks.com/resources/type/ebooks

Thank you
Xiao Li
(lixiao@databricks.com)

Understanding Query Plans and Spark UIs

More Related Content

What's hot

Similar to Understanding Query Plans and Spark UIs

More from Databricks

Recently uploaded

Understanding Query Plans and Spark UIs