Spark sql under the hood - Data KRK meetup

•

2 likes•311 views

In recent years Apache Spark has received a lot of hype in the Big Data community. It is seen as a silver bullet for all problems related to gathering, processing and analysing massive datasets. Due to its rapid evolution (do not forget that Spark is one the most active open source projects), some of the ideas behind it seem to be unclear and require digging into different blog posts and presentations. During this talk we will dive into the internals of Spark SQL, look how our queries are translated to the actual code executed on the nodes and find different ways to debug and optimize them.

Data & Analytics

Spark SQL under the hood
Mikołaj Kromka, VirtusLab
mkromka@virtuslab.com
DataKRK meetup
Kraków, 06.09.2017

Bio
● Software engineer at VirtusLab and Spark trainer at Virtusity
● Focused mostly on the Scala ecosystem
● Currently developing a new Analytics Platform for Tesco

Brief (and selective) history of structuring data
● Codd's relational model (1969 - 50th anniversary in two years!)
● SQL
○ one of the first commercial implementations at IBM (early 1970s)
○ SQL-based RDBMS developed at Relational Software, Inc (now Oracle Corporation) in the late 1970s
● Apache Hive bringing SQL-like capabilities to the Big Data world (open sourced 2008)
● Shark
● Spark SQL (2014)

Apache Spark: why the fuss?
● General engine for large-scale data processing
● Resilient Distributed Datasets
● Generating graph of computations automatically
● Scala, Java, Python and R APIs
● A lot of libraries on top of it (SQL, ML, GraphX, Streaming)
● One of the most active open source projects
source https://spark.apache.org/docs/latest/cluster-overview.html

Do we need anything else?
YES
● Data is usually structured - but RDDs contain arbitrary Java/Python objects
and Transformations of RDDs contain arbitrary code
● Analysts know SQL/Hive
● Large SQL/HiveQL codebases that we would like to reuse
● Connecting to different data sources with (semi-)structured datasets
● Applying advanced and complex algorithms (such as ML)

Spark SQL to the rescue
source https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Catalyst Optimizer
source: https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

Analysis
Resolves references of attributes (assigns them types or matches them to an input table)

Physical Planning
source http://henning.kropponline.de/2016/12/11/broadcast-join-with-spark/
BroadcastHashJoin
source http://www.waitingforcode.com/apache-spark-sql/sort-merge-join-spark-sql/read

Code generation
● Why do we need it?
○ without it simple expressions such as (x + y) + 1 would be interpreted from scratch for every row in the
dataset
● Newer version of spark SQL support Whole-Stage Code Generation (not only expressions)

Vectorization
no vectorization (json source)
...
[cropped source code]
vectorization (parquet source)

Some advice
● Don't stick to the Dataset API blindly - some operations cannot be inlined during codegen and will
be slower
● Don't think that Spark SQL has all features of the traditional RDBMS, if you don't handle large
amounts of data Postgres will be enough
● If possible don't create DataFrames from RDDs using .toDF() method, use specific
DataFrameReader instead
● Analyse plans generated by the Catalyst to see if some optimizations were missed or there is a
place to improve
● Spark UI is always useful

What's hot

GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks

H2O World - H2O Rains with Databricks CloudSri Ambati

Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Databricks

Insights Without Tradeoffs: Using Structured StreamingDatabricks

NATE-Central-LogStefan Coetzee

Machine Learning Data Lineage with MLflow and Delta LakeDatabricks

Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney

Presto Summit 2018 - 01 - Facebook Prestokbajda

Introduction to basic data analytics toolsNascenia IT

Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit

Superset druid realtimearupmalakar

Spark Summit EU 2015: Reynold Xin KeynoteDatabricks

Spline 0.3 User GuideVaclav Kosar

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks

Vertica And Spark: Connecting Computation And DataSpark Summit

Machine Learning on the Microsoft StackLynn Langit

Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashAmazon Web Services

An Introduction to Sparkling Water by Michal MalohlavaSpark Summit

Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016Ivan Ermilov

Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks

What's hot (20)

GraphFrames: DataFrame-based graphs for Apache® Spark™

H2O World - H2O Rains with Databricks Cloud

Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)

Insights Without Tradeoffs: Using Structured Streaming

NATE-Central-Log

Machine Learning Data Lineage with MLflow and Delta Lake

Apache Arrow Flight: A New Gold Standard for Data Transport

Presto Summit 2018 - 01 - Facebook Presto

Introduction to basic data analytics tools

Practical Distributed Machine Learning Pipelines on Hadoop

Superset druid realtime

Spark Summit EU 2015: Reynold Xin Keynote

Spline 0.3 User Guide

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...

Vertica And Spark: Connecting Computation And Data

Machine Learning on the Microsoft Stack

Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash

An Introduction to Sparkling Water by Michal Malohlava

Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Similar to Spark sql under the hood - Data KRK meetup

Writing Apache Spark and Apache Flink Applications Using Apache BahirLuciano Resende

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Jumpstart on Apache Spark 2.2 on DatabricksDatabricks

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn

Spark SQLJoud Khattab

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Using pySpark with Google Colab & Spark 3.0 previewMario Cartia

Started with-apache-sparkHappiest Minds Technologies

Apache Spark OverviewDharmjit Singh

Spark streaming , Spark SQLYousun Jeong

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

20170126 big data processingVienna Data Science Group

What's New in Spark 2?Eyal Ben Ivri

Apache Spark: Lightning Fast Cluster ComputingAll Things Open

Media_Entertainment_VeriticalsPeyman Mohajerian

Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao

A look under the hood at Apache Spark's API and engine evolutionsDatabricks

Scalable Machine Learning with PySparkLadle Patel

Similar to Spark sql under the hood - Data KRK meetup (20)

Writing Apache Spark and Apache Flink Applications Using Apache Bahir

Jump Start on Apache® Spark™ 2.x with Databricks

Jumpstart on Apache Spark 2.2 on Databricks

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...

Spark SQL

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Using pySpark with Google Colab & Spark 3.0 preview

Started with-apache-spark

Apache Spark Overview

Spark streaming , Spark SQL

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

Jump Start with Apache Spark 2.0 on Databricks

20170126 big data processing

What's New in Spark 2?

Apache Spark: Lightning Fast Cluster Computing

Media_Entertainment_Veriticals

Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform

A look under the hood at Apache Spark's API and engine evolutions

Scalable Machine Learning with PySpark

Recently uploaded

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

Decoding Loan Approval: Predictive Modeling in ActionBoston Institute of Analytics

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

RadioAdProWritingCinderellabyButleri.pdfgstagge

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

Recently uploaded (20)

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

Call Girls in Saket 99530🔝 56974 Escort Service

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

Customer Service Analytics - Make Sense of All Your Data.pptx

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

20240419 - Measurecamp Amsterdam - SAM.pdf

Decoding Loan Approval: Predictive Modeling in Action

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

RA-11058_IRR-COMPRESS Do 198 series of 1998

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

RadioAdProWritingCinderellabyButleri.pdf

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

Spark sql under the hood - Data KRK meetup

1. Spark SQL under the hood Mikołaj Kromka, VirtusLab mkromka@virtuslab.com DataKRK meetup Kraków, 06.09.2017

2. Bio ● Software engineer at VirtusLab and Spark trainer at Virtusity ● Focused mostly on the Scala ecosystem ● Currently developing a new Analytics Platform for Tesco

3. Brief (and selective) history of structuring data ● Codd's relational model (1969 - 50th anniversary in two years!) ● SQL ○ one of the first commercial implementations at IBM (early 1970s) ○ SQL-based RDBMS developed at Relational Software, Inc (now Oracle Corporation) in the late 1970s ● Apache Hive bringing SQL-like capabilities to the Big Data world (open sourced 2008) ● Shark ● Spark SQL (2014)

4. Apache Spark: why the fuss? ● General engine for large-scale data processing ● Resilient Distributed Datasets ● Generating graph of computations automatically ● Scala, Java, Python and R APIs ● A lot of libraries on top of it (SQL, ML, GraphX, Streaming) ● One of the most active open source projects source https://spark.apache.org/docs/latest/cluster-overview.html

5. Apache Spark: why the fuss?

6. Do we need anything else? YES ● Data is usually structured - but RDDs contain arbitrary Java/Python objects and Transformations of RDDs contain arbitrary code ● Analysts know SQL/Hive ● Large SQL/HiveQL codebases that we would like to reuse ● Connecting to different data sources with (semi-)structured datasets ● Applying advanced and complex algorithms (such as ML)

7. Spark SQL to the rescue

8. Spark SQL to the rescue source https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

9. Spark SQL to the rescue

10. Catalyst Optimizer source: https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

11. Analysis Resolves references of attributes (assigns them types or matches them to an input table)

12. Logical Optimization

13. Physical Planning source http://henning.kropponline.de/2016/12/11/broadcast-join-with-spark/ BroadcastHashJoin source http://www.waitingforcode.com/apache-spark-sql/sort-merge-join-spark-sql/read

14. Physical Planning

15. Code generation ● Why do we need it? ○ without it simple expressions such as (x + y) + 1 would be interpreted from scratch for every row in the dataset ● Newer version of spark SQL support Whole-Stage Code Generation (not only expressions)

16. Spark UI

17. Vectorization no vectorization (json source) ... [cropped source code] vectorization (parquet source)

18. Some advice ● Don't stick to the Dataset API blindly - some operations cannot be inlined during codegen and will be slower ● Don't think that Spark SQL has all features of the traditional RDBMS, if you don't handle large amounts of data Postgres will be enough ● If possible don't create DataFrames from RDDs using .toDF() method, use specific DataFrameReader instead ● Analyse plans generated by the Catalyst to see if some optimizations were missed or there is a place to improve ● Spark UI is always useful

19. questions?

Spark sql under the hood - Data KRK meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark sql under the hood - Data KRK meetup

Similar to Spark sql under the hood - Data KRK meetup (20)

Recently uploaded

Recently uploaded (20)

Spark sql under the hood - Data KRK meetup