Spark sql

•

3 likes•1,616 views

Spark SQL is a component of Apache Spark that introduces SQL support. It includes a DataFrame API that allows users to write SQL queries on Spark, a Catalyst optimizer that converts logical queries to physical plans, and data source APIs that provide a unified way to read/write data in various formats. Spark SQL aims to make SQL queries on Spark more efficient and extensible.

Software

OUTLINE
➤ Background
➤ Spark SQL overview
➤ Spark components
➤ Data Source API
➤ DataFrame API
➤ Catalyst optimizer
➤ SparkSQL future
➤ Conclusion
➤ References

WHY BIG DATA WITH SQL
➤ SQL is compatible with tooling
➤ e.g. Connect to existing BI tools via JDBC/ODBC
➤ Large pool of engineers proﬁcient in SQL
➤ Compared with MapReduce, SQL is more expressive/succinct

“Spark SQL is more than SQL
-not so top secret

CHALLENGES
➤ Spark needs to perform ETL on various data source formats
➤ Solution: Data source APIs
➤ Implement SQL for Spark
➤ In an extensible way
➤ Solution: DataFrame API
➤ In an eﬃcient way
➤ Solution: Catalyst optimizer
Spark SQL
Major Components

DATA SOURCE API
➤ Read and write with a variety of formats

DATA SOURCE API
➤ Uniﬁed interface to reading/writing data in a variety of formats

SQL ON HADOOP
➤ Hive and Pig pipeline
➤ Problems
➤ Quite similar but implements its own optimizer/executor
➤ No common data structure to use both Hive/Pig/Other DSL

DATAFRAME API
➤ DataFrame as a front end
➤ Multiple DSL share same optimizer and executor
➤ Plug your own DSL too

MAJOR MILESTONE IN SPARK SQL
SchemaRDD DataFrame API Datasets
1.0 1.2 1.3 1.5 1.6 Spark Version
SparkSQL becomes Spark core distribution
Rename due to
OO structure change
compile-time type safety
further optimization

DATAFRAME API
➤ Idea borrowed from Python Panda
➤ single node tabular data with API for (math, stats, algebra…)
➤ Def:
➤ RDD + Schema
➤ RDDs with additional relational operators such as
➤ selecting required columns
➤ joining diﬀerent data sources
➤ aggregation
➤ ﬁltering

CATALYST OPTIMIZER
➤ Goal: convert logical plan to physical plans
➤ Process:
➤ Logical plan is a tree representing data and schema
➤ The tree is manipulated and optimized by catalyst rules

CATALYST OPTIMIZER
➤ Native query planning

CATALYST OPTIMIZER
➤ Optimization Rules example

CATALYST OPTIMIZER
➤ Optimization Rules continued
➤ Allow deﬁning customized rules

CATALYST OPTIMIZER
➤ Optimizing rules
➤ Eliminate subqueries
➤ Constant folding
➤ Simplify ﬁlters
➤ PushPredicate through ﬁlter
➤ Project collapsing

SPARK SQL FUTURE
➤ Tungsten - Optimization for the next few years
➤ Begin with hardware trends

SPARK SQL FUTURE
➤ Tungsten - Optimization for the next few years
➤ Recall our SparkSQL workﬂow

SPARK SQL FUTURE
➤ Tungsten - Preparing Spark for next 5 years
➤ Begin with hardware trends

SPARK SQL FUTURE
➤ Tungsten - Preparing Spark for next 5 years
➤ Substantially speed up execution by optimizing CPU eﬃciency via
➤ Runtime code generation
➤ Exploiting cache locality
➤ Oﬀ-heap memory management

CONCLUSION
➤ SparkSQL intuition
➤ SparkSQL pipeline/
architecture

REFERENCES
➤ http://www.slideshare.net/datamantra/anatomy-of-data-frame-api
➤ http://www.slideshare.net/databricks/2015-0616-spark-summit
➤ http://www.slideshare.net/databricks/spark-sql-deep-dive-
melbroune
➤ http://www.slideshare.net/databricks/spark-sqlsse2015public
➤ http://www.slideshare.net/datamantra/introduction-to-structured-
data-in-spark
➤ Data bricks oﬃcial blog

MY LEARNING WORKFLOW
➤ Recently I gradually set up my workﬂow and reduced learning cycle
➤ Everyone has diﬀerent learning workﬂows
➤ But let’s share and make progress together
Knowledge
collecting
Summarization Sharing
Evernote web clipper
Organize notes by tags
MindManager Blog:Evernote+postach.io
Keynote/PowerPointEvernote markdown doc
with Marxico

What's hot

Building an open data platform with apache icebergAlluxio, Inc.

Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfAnya Bida

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks

High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB

Understanding Query Plans and Spark UIsDatabricks

Parquet performance tuning: the missing guideRyan Blue

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

Introduction to ML with Apache Spark MLlibTaras Matyashovsky

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Spark SQLJoud Khattab

Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Memory Management in Apache SparkDatabricks

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

What's hot (20)

Building an open data platform with apache iceberg

Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf

Efficient Data Storage for Analytics with Apache Parquet 2.0

Spark SQL Deep Dive @ Melbourne Spark Meetup

High-speed Database Throughput Using Apache Arrow Flight SQL

Understanding Query Plans and Spark UIs

Parquet performance tuning: the missing guide

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...

Hudi architecture, fundamentals and capabilities

Introduction to ML with Apache Spark MLlib

Apache Iceberg - A Table Format for Hige Analytic Datasets

Spark SQL

Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...

Iceberg: A modern table format for big data (Strata NY 2018)

Memory Management in Apache Spark

How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Apache Iceberg: An Architectural Look Under the Covers

The Parquet Format and Performance Optimization Opportunities

Enabling Vectorized Engine in Apache Spark

Processing Large Data with Apache Spark -- HasGeek

Viewers also liked

Intro to Spark and Spark SQLjeykottalam

Building a modern Application with DataFramesSpark Summit

Spark SQLCaserta

Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra

Optimizing Apache Spark SQL JoinsDatabricks

Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks

Spark what's new what's comingDatabricks

Spark + Scikit Learn- Performance Tuning晨揚施

Building a modern Application with DataFramesDatabricks

Spark sql동현 강

Predictive modeling healthcareTaposh Roy

20140908 spark sql & catalystTakuya UESHIN

Building distributed processing system from scratch - Part 2datamantra

Ranking the Web with SparkSylvain Zimmer

Introduction to Structured Streamingdatamantra

Keyboard covert channelsFreeman Zhang

AMP Camp 5 Introjeykottalam

Introduction to datasetdatamantra

Evolution of apache sparkdatamantra

Spark sql meetupMichael Zhang

Viewers also liked (20)

Intro to Spark and Spark SQL

Building a modern Application with DataFrames

Spark SQL

Anatomy of Data Frame API : A deep dive into Spark Data Frame API

Optimizing Apache Spark SQL Joins

Introducing DataFrames in Spark for Large Scale Data Science

Spark what's new what's coming

Spark + Scikit Learn- Performance Tuning

Building a modern Application with DataFrames

Spark sql

Predictive modeling healthcare

20140908 spark sql & catalyst

Building distributed processing system from scratch - Part 2

Ranking the Web with Spark

Introduction to Structured Streaming

Keyboard covert channels

AMP Camp 5 Intro

Introduction to dataset

Evolution of apache spark

Spark sql meetup

Similar to Spark sql

Introduction to Spark with PythonGokhan Atil

Introduction to Structured Data Processing with Spark SQLdatamantra

Data processing with spark in r & pythonMaloy Manna, PMP®

Real Time Analytics with DseDataStax Academy

Data Pipeline for The Big Data/Data Science OKCMark Smith

Machine Learning by Example - Apache SparkMeeraj Kunnumpurath

700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan

700 Queries Per Second with Updates: Spark As A Real-Time Web ServiceSpark Summit

Introduction to spark 2.0datamantra

From Idea to Model: Productionizing Data Pipelines with Apache AirflowDatabricks

ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney

Introduction to Spark 2.0 Dataset APIdatamantra

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks

Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformDataStax Academy

Healthcare Claim Reimbursement using Apache SparkDatabricks

Spark as a Service with Azure DatabricksLace Lofranco

Cowboy Dating with Big Data or DWH Evolution in Action, Борис ТрофимовSigma Software

GraphQL vs. (the) RESTcoliquio GmbH

Introduction to Datasource V2 APIdatamantra

Deploying Data Science Engines to ProductionMostafa Majidpour

Similar to Spark sql (20)

Introduction to Spark with Python

Introduction to Structured Data Processing with Spark SQL

Data processing with spark in r & python

Real Time Analytics with Dse

Data Pipeline for The Big Data/Data Science OKC

Machine Learning by Example - Apache Spark

700 Updatable Queries Per Second: Spark as a Real-Time Web Service

700 Queries Per Second with Updates: Spark As A Real-Time Web Service

Introduction to spark 2.0

From Idea to Model: Productionizing Data Pipelines with Apache Airflow

ACM TechTalks : Apache Arrow and the Future of Data Frames

Introduction to Spark 2.0 Dataset API

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...

Large Scale Data Analytics with Spark and Cassandra on the DSE Platform

Healthcare Claim Reimbursement using Apache Spark

Spark as a Service with Azure Databricks

Cowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов

GraphQL vs. (the) REST

Introduction to Datasource V2 API

Deploying Data Science Engines to Production

Recently uploaded

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10

AI & Machine Learning Presentation TemplatePresentation.STUDIO

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health

Define the academic and professional writing..pdfPearlKirahMaeRagusta1

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171

8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek

VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale

Right Money Management App For Your Financial GoalsJhone kinadey

10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

Recently uploaded (20)

HR Software Buyers Guide in 2024 - HRSoftware.com

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf

AI & Machine Learning Presentation Template

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

Define the academic and professional writing..pdf

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...

8257 interfacing 2 in microprocessor for btech students

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques

VTU technical seminar 8Th Sem on Scikit-learn

Right Money Management App For Your Financial Goals

10 Trends Likely to Shape Enterprise Technology in 2024

Diamond Application Development Crafting Solutions with Precision

Spark sql

1. SPARK SQL Shijie Zhang

2. OUTLINE ➤ Background ➤ Spark SQL overview ➤ Spark components ➤ Data Source API ➤ DataFrame API ➤ Catalyst optimizer ➤ SparkSQL future ➤ Conclusion ➤ References

3. WHY BIG DATA WITH SQL ➤ SQL is compatible with tooling ➤ e.g. Connect to existing BI tools via JDBC/ODBC ➤ Large pool of engineers proﬁcient in SQL ➤ Compared with MapReduce, SQL is more expressive/succinct

4. “Spark SQL is more than SQL -not so top secret

5. CHALLENGES ➤ Spark needs to perform ETL on various data source formats ➤ Solution: Data source APIs ➤ Implement SQL for Spark ➤ In an extensible way ➤ Solution: DataFrame API ➤ In an eﬃcient way ➤ Solution: Catalyst optimizer Spark SQL Major Components

6. SPARK SQL OVERALL ARCHITECTURE

7. SPARK SQL WORKFLOW

8. DATA SOURCE API ➤ Read and write with a variety of formats

9. DATA SOURCE API ➤ Uniﬁed interface to reading/writing data in a variety of formats

10. DATA SOURCE API ➤ Uniﬁed interface to reading/writing data in a variety of formats

11. DATA SOURCE API ➤ Uniﬁed interface to reading/writing data in a variety of formats

12. DATA SOURCE API ➤ Uniﬁed interface to reading/writing data in a variety of formats

13. SQL ON HADOOP ➤ Hive and Pig pipeline ➤ Problems ➤ Quite similar but implements its own optimizer/executor ➤ No common data structure to use both Hive/Pig/Other DSL

14. DATAFRAME API ➤ DataFrame as a front end ➤ Multiple DSL share same optimizer and executor ➤ Plug your own DSL too

15. MAJOR MILESTONE IN SPARK SQL SchemaRDD DataFrame API Datasets 1.0 1.2 1.3 1.5 1.6 Spark Version SparkSQL becomes Spark core distribution Rename due to OO structure change compile-time type safety further optimization

16. MAJOR MILESTONE IN SPARK SQL SchemaRDD DataFrame API Datasets 1.0 1.2 1.3 1.5 1.6 Spark Version SparkSQL becomes Spark core distribution Rename due to OO structure change compile-time type safety further optimization

17. DATAFRAME API ➤ Idea borrowed from Python Panda ➤ single node tabular data with API for (math, stats, algebra…) ➤ Def: ➤ RDD + Schema ➤ RDDs with additional relational operators such as ➤ selecting required columns ➤ joining diﬀerent data sources ➤ aggregation ➤ ﬁltering

18. DATAFRAME API ➤ Writing less code

19. DATAFRAME API ➤ Faster implementations

20. CATALYST OPTIMIZER ➤ Goal: convert logical plan to physical plans ➤ Process: ➤ Logical plan is a tree representing data and schema ➤ The tree is manipulated and optimized by catalyst rules

21. CATALYST OPTIMIZER ➤ An example query

22. CATALYST OPTIMIZER ➤ Native query planning

23. CATALYST OPTIMIZER ➤ Optimization Rules example

24. CATALYST OPTIMIZER ➤ Optimization Rules continued ➤ Allow deﬁning customized rules

25. CATALYST OPTIMIZER ➤ Optimizing rules ➤ Eliminate subqueries ➤ Constant folding ➤ Simplify ﬁlters ➤ PushPredicate through ﬁlter ➤ Project collapsing

26. SPARK SQL FUTURE ➤ Tungsten - Optimization for the next few years ➤ Begin with hardware trends

27. SPARK SQL FUTURE ➤ Tungsten - Optimization for the next few years ➤ Recall our SparkSQL workﬂow

28. SPARK SQL FUTURE ➤ Tungsten - Preparing Spark for next 5 years ➤ Begin with hardware trends

29. SPARK SQL FUTURE ➤ Tungsten - Preparing Spark for next 5 years ➤ Substantially speed up execution by optimizing CPU eﬃciency via ➤ Runtime code generation ➤ Exploiting cache locality ➤ Oﬀ-heap memory management

30. CONCLUSION ➤ SparkSQL intuition ➤ SparkSQL pipeline/ architecture

31. REFERENCES ➤ http://www.slideshare.net/datamantra/anatomy-of-data-frame-api ➤ http://www.slideshare.net/databricks/2015-0616-spark-summit ➤ http://www.slideshare.net/databricks/spark-sql-deep-dive- melbroune ➤ http://www.slideshare.net/databricks/spark-sqlsse2015public ➤ http://www.slideshare.net/datamantra/introduction-to-structured- data-in-spark ➤ Data bricks oﬃcial blog

32. MY LEARNING WORKFLOW ➤ Recently I gradually set up my workflow and reduced learning cycle ➤ Everyone has different learning workflows ➤ But let’s share and make progress together Knowledge collecting Summarization Sharing Evernote web clipper Organize notes by tags MindManager Blog:Evernote+postach.io Keynote/PowerPointEvernote markdown doc with Marxico

Spark sql

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark sql

Similar to Spark sql (20)

Recently uploaded

Recently uploaded (20)

Spark sql