Deep Dive into Spark SQL with Advanced Performance Tuning

Deep Dive Into
Takuya UESHIN
Hadoop / Spark Conference 2019, Mar 2019
1
SQL
with Advanced Performance Tuning

2
About Me
- Software Engineer @databricks
- Apache Spark Committer
- Twitter: @ueshin
- GitHub: github.com/ueshin

DATABRICKS WORKSPACE
Databricks Delta ML Frameworks
DATABRICKS CLOUD SERVICE
DATABRICKS RUNTIME
Reliable & Scalable Simple & Integrated
Databricks Unified Analytics Platform
APIs
Jobs
Models
Notebooks
Dashboards End to end ML lifecycle

Spark SQL
A highly scalable and efficient relational
processing engine with ease-to-use APIs and
mid-query fault tolerance.
4

Run Everywhere
様々なデータソース
(Cassandra, Kafka,
RDBMSなど) や
ファイルフォーマット
(Parquet, ORC, CSV,
JSONなど) のデータを統
合して解析、処理
5

The not-so-secret truth...
6
is not only SQL.SQL

Not Only SQL
Sparkアプリケーション、ライブラリもSpark SQLをベー
スにしている
• Structured streaming: ストリーム処理
• MLlib: 機械学習
• GraphFrame: グラフ計算
• Your own Spark applications: using SQL,
DataFrame and Dataset APIs
8

Lazy Evaluation
9
最適化、実行はできるだけ遅く
Spark SQLは関数やライブラリを横断して最適
化する
ライブラリやSQL/DataFrame/Dataset APIを
使ったクエリ全体を最適化

Spark SQL
10
A compiler from queries to RDDs.

Performance Tuning for Optimal Plans
Run EXPLAIN Plan.
Interpret Plan.
Tune Plan.
11

12
- Explain command/APIs
- Spark UI / Spark History
Server のSQLタブ

Declarative APIs
何をしたいのか?
• SQL API: ANSI SQL:2003 / HiveQL.
• Dataset/DataFrame APIs: richer,
user-friendly な language-integrated
インターフェイス
15

Declarative APIs
16
SQL, DataFrame, Datasetの違い
• DataFrame API は untyped リレーショナル処理
• Dataset API は typed バージョン, ただし処理内容によっては
パフォーマンスペナルティがある
[SPARK-14083]
• http://dbricks.co/29xYnqR

Metadata Catalog
• Persistent Hive metastore [Hive 0.12 - Hive 2.3.3]
• Session-local temporary view manager
• Cross-session global temporary view manager
• Session-local function registry
18

Metadata Catalog
Session-local function registry
• Easy-to-use lambda UDF
• PySpark Python UDF / Pandas UDF
• Native UDAF インターフェイス
• Hive UDF, UDAF, UDTF サポート
• 約300 の built-in SQL functions
• including 30+ higher-order built-in functions
• Blog for higher-order functions: https://dbricks.co/2rR8vAr
19

Performance Tips - Catalog
Partition metadata 取得のコスト:
- Hive metastore のアップグレード
- Cardinality の高いパーティションカラムを避ける
- Partition pruning predicates (improved in [SPARK-20331])
20

Cache Manager
• プランが一致した場合にキャッシュデータと置き
換える
• Cross-session
• 初めて利用する時にキャッシュデータ作成
• 依存しているtable/viewのデータが更新される
とキャッシュデータを無効化
22

Performance Tips - Cache
Cache: 必ずしも速いわけではない
- ディスクに書き出される場合がある
- 必要なければキャッシュしないようにする
23

Optimizer
heuristics と cost ベースでプランの書き換え
25
• Outer join elimination
• Constraint propagation
• Join reordering
and many more.
• Column pruning
• Predicate push down
• Constant folding

Performance Tips - Optimizer
独自の Optimizer や Planner Rule の組み込み
• SparkSessionExtensions
• ExperimentalMethodsクラス
• var extraOptimizations: Seq[Rule[LogicalPlan]] = Nil
• var extraStrategies: Seq[Strategy] = Nil
• Examples in the Herman’s talk Deep Dive into Catalyst
Optimizer
• Join two intervals: http://dbricks.co/2etjIDY
26

Planner
• Logical PlanをPhysical Planへ (what to how)
• コストに基づいて最適な Physical Plan を選択
28
table1 table2
Join
broadcast
hash join
sort merge
join
OR
broadcast join has lower cost if
one table can fit in memory
table1 table2 table1 table2

Performance Tips - Join Selection
29
table 1
table 2
join result
broadcast
broadcast join
table 1
table 2
shuffled
shuffled
join result
shuffle join

Performance Tips - Join Selection
broadcast join vs shuffle join (broadcast の方が速い)
• spark.sql.autoBroadcastJoinThreshold
• 統計情報を最新に保つ
• broadcastJoin ヒント
30

Performance Tips - Equal Join
… t1 JOIN t2 ON t1.id = t2.id AND t1.value < t2.value
… t1 JOIN t2 ON t1.value < t2.value
Join 条件に少なくとも1つの equal 条件を入れる
31

Performance Tips - Equal Join
… t1 JOIN t2 ON t1.id = t2.id AND t1.value < t2.value
… t1 JOIN t2 ON t1.value < t2.value
32
O(n ^ 2)
O(n)

Query Execution
• Memory Manager: メモリ利用状況を追跡してtaskや
operator間のメモリを調整
• Code Generator: Physical PlanをコンパイルしてJava
コードを生成
• Tungsten Engine: CPUやメモリに対して効率的なバイ
ナリデータフォーマットやデータ構造
34

Performance Tips - Memory Manager
spark.executor.memory と spark.memory.fraction を、監視外メ
モリのために余裕を持って設定する. Sparkが監視できないメモ
リ領域がある(netty buffer, parquet writer buffer).
spark.memory.offHeap.enabled と spark.memory.offHeap.size
を設定して off-heapを有効化, それにあわせて
spark.executor.memory を減らす.
35

Performance Tip - WholeStage codegen
spark.sql.codegen.hugeMethodLimit の設定
バイトコードのサイズが8kバイトを超える大きな
メソッドはJITコンパイラがコンパイルできない
37

Data Sources
• computation と storage の分離
• Complete data pipeline:
• External storage がデータを供給
• Spark が処理
• Sparkのデータ処理がとても速い場合、データ
ソースがボトルネックになり得る
38

Scan Vectorization
• Vectorizationでより効率的なカラムなデータ読み込み
• JVMがSIMDを利用しやすい
• ……

Partitioning and Bucketing
• data skipping と pre-shuffle のためのファイルレイアウト
• 不要なIOとシャッフルを避けてスピードアップ
• The summit talk: http://dbricks.co/2oG6ZBL

Performance Tips - DataSource
• vectorized reading が可能なファイルフォーマット
を利用 (Parquet, ORC)
• ファイルベースのデータソースの場合には、
Patitioning や Bucketing を検討
41

Apache Spark™
• Use Cases
• Research
• Technical Deep Dives
AI
• Productionizing ML
• Deep Learning
• Cloud Hardware
Fields
• Data Science
• Data Engineering
• Enterprise
5000+ ATTENDEES
Practitioners:
Data Scientists, Data Engineers,
Analysts, Architects
Leaders:
Engineering Management, VPs,
Heads of Analytics & Data, CxOs
TRACKS
databricks.com/sparkaisummit

43
Nike: Enabling Data Scientists to bring their Models to Market
Facebook: Vectorized Query Execution in Apache Spark at Facebook
Tencent: Large-scale Malicious Domain Detection with Spark AI
IBM: In-memory storage Evolution in Apache Spark
Capital One: Apache Spark and Sights at Speed: Streaming, Feature
management and Execution
Apple: Making Nested Columns as First Citizen in Apache Spark SQL
EBay: Managing Apache Spark workload and automatic optimizing.
Google: Validating Spark ML Jobs
HP: Apache Spark for Cyber Security in big company
Microsoft: Apache Spark Serving: Unifying Batch, Streaming and
RESTful Serving
ABSA Group: A Mainframe Data Source for Spark SQL and Streaming
Facebook: an efficient Facebook-scale shuffle service
IBM: Make your PySpark Data Fly with Arrow!
Facebook : Distributed Scheduling Framework for Apache Spark
Zynga: Automating Predictive Modeling at Zynga with PySpark
World Bank: Using Crowdsourced Images to Create Image Recognition
Models and NLP to Augment Global Trade indicator
JD.com: Optimizing Performance and Computing Resource.
Microsoft: Azure Databricks with R: Deep Dive
ICL: Cooperative Task Execution for Apache Spark
Airbnb: Apache Spark at Airbnb
Netflix: Migrating to Apache Spark at Netflix
Microsoft: Infrastructure for Deep Learning in
Apache Spark
Intel: Game playing using AI on Apache Spark
Facebook: Scaling Apache Spark @ Facebook
Lyft: Scaling Apache Spark on K8S at Lyft
Uber: Using Spark Mllib Models in a Production
Training and Serving Platform
Apple: Bridging the gap between Datasets and
DataFrames
Salesforce: The Rule of 10,000 Spark Jobs
Target: Lessons in Linear Algebra at Scale with
Apache Spark
Workday: Lesson Learned Using Apache Spark

Thank you
Takuya UESHIN (ueshin@databricks.com)
44

Deep Dive into Spark SQL with Advanced Performance Tuning

More Related Content

What's hot

Similar to Deep Dive into Spark SQL with Advanced Performance Tuning

More from Takuya UESHIN

Deep Dive into Spark SQL with Advanced Performance Tuning