Spark sql

•

3 likes•1,617 views

Spark SQL is a component of Apache Spark that introduces SQL support. It includes a DataFrame API that allows users to write SQL queries on Spark, a Catalyst optimizer that converts logical queries to physical plans, and data source APIs that provide a unified way to read/write data in various formats. Spark SQL aims to make SQL queries on Spark more efficient and extensible.

Software

OUTLINE
➤ Background
➤ Spark SQL overview
➤ Spark components
➤ Data Source API
➤ DataFrame API
➤ Catalyst optimizer
➤ SparkSQL future
➤ Conclusion
➤ References

WHY BIG DATA WITH SQL
➤ SQL is compatible with tooling
➤ e.g. Connect to existing BI tools via JDBC/ODBC
➤ Large pool of engineers proﬁcient in SQL
➤ Compared with MapReduce, SQL is more expressive/succinct

“Spark SQL is more than SQL
-not so top secret

CHALLENGES
➤ Spark needs to perform ETL on various data source formats
➤ Solution: Data source APIs
➤ Implement SQL for Spark
➤ In an extensible way
➤ Solution: DataFrame API
➤ In an eﬃcient way
➤ Solution: Catalyst optimizer
Spark SQL
Major Components

DATA SOURCE API
➤ Read and write with a variety of formats

DATA SOURCE API
➤ Uniﬁed interface to reading/writing data in a variety of formats

SQL ON HADOOP
➤ Hive and Pig pipeline
➤ Problems
➤ Quite similar but implements its own optimizer/executor
➤ No common data structure to use both Hive/Pig/Other DSL

DATAFRAME API
➤ DataFrame as a front end
➤ Multiple DSL share same optimizer and executor
➤ Plug your own DSL too

MAJOR MILESTONE IN SPARK SQL
SchemaRDD DataFrame API Datasets
1.0 1.2 1.3 1.5 1.6 Spark Version
SparkSQL becomes Spark core distribution
Rename due to
OO structure change
compile-time type safety
further optimization

DATAFRAME API
➤ Idea borrowed from Python Panda
➤ single node tabular data with API for (math, stats, algebra…)
➤ Def:
➤ RDD + Schema
➤ RDDs with additional relational operators such as
➤ selecting required columns
➤ joining diﬀerent data sources
➤ aggregation
➤ ﬁltering

CATALYST OPTIMIZER
➤ Goal: convert logical plan to physical plans
➤ Process:
➤ Logical plan is a tree representing data and schema
➤ The tree is manipulated and optimized by catalyst rules

CATALYST OPTIMIZER
➤ Native query planning

CATALYST OPTIMIZER
➤ Optimization Rules example

CATALYST OPTIMIZER
➤ Optimization Rules continued
➤ Allow deﬁning customized rules

CATALYST OPTIMIZER
➤ Optimizing rules
➤ Eliminate subqueries
➤ Constant folding
➤ Simplify ﬁlters
➤ PushPredicate through ﬁlter
➤ Project collapsing

SPARK SQL FUTURE
➤ Tungsten - Optimization for the next few years
➤ Begin with hardware trends

SPARK SQL FUTURE
➤ Tungsten - Optimization for the next few years
➤ Recall our SparkSQL workﬂow

SPARK SQL FUTURE
➤ Tungsten - Preparing Spark for next 5 years
➤ Begin with hardware trends

SPARK SQL FUTURE
➤ Tungsten - Preparing Spark for next 5 years
➤ Substantially speed up execution by optimizing CPU eﬃciency via
➤ Runtime code generation
➤ Exploiting cache locality
➤ Oﬀ-heap memory management

CONCLUSION
➤ SparkSQL intuition
➤ SparkSQL pipeline/
architecture

REFERENCES
➤ http://www.slideshare.net/datamantra/anatomy-of-data-frame-api
➤ http://www.slideshare.net/databricks/2015-0616-spark-summit
➤ http://www.slideshare.net/databricks/spark-sql-deep-dive-
melbroune
➤ http://www.slideshare.net/databricks/spark-sqlsse2015public
➤ http://www.slideshare.net/datamantra/introduction-to-structured-
data-in-spark
➤ Data bricks oﬃcial blog

MY LEARNING WORKFLOW
➤ Recently I gradually set up my workﬂow and reduced learning cycle
➤ Everyone has diﬀerent learning workﬂows
➤ But let’s share and make progress together
Knowledge
collecting
Summarization Sharing
Evernote web clipper
Organize notes by tags
MindManager Blog:Evernote+postach.io
Keynote/PowerPointEvernote markdown doc
with Marxico

What's hot

Efficient Data Storage for Analytics with Apache Parquet 2.0

Cloudera, Inc.

Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set. Speakers: Junjie Chen, Junping Du

Apache Spark on K8S Best Practice and Performance in the Cloud

Databricks

Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.

A Deep Dive into Query Execution Engine of Spark SQL

Databricks

PySpark dataframe

Jaemun Jung

Impala Architecture presentation

hadooparchbook

In this session we will introduce the delta-rs project which is helping bring the power of Delta Lake outside of the Spark ecosystem. By providing a foundational Delta Lake library in Rust, delta-rs can enable native bindings in Python, Ruby, Golang, and more.We will review what functionality delta-rs supports in its current Rust and Python APIs and the upcoming roadmap. We will also give an overview of one of the first projects to use it in production: kafka-delta-ingest, which builds on delta-rs to provide a high throughput service to bring data from Kafka into Delta Lake.

Growing the Delta Ecosystem to Rust and Python with Delta-RS

Databricks

This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames. In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames in Databricks Community Edition. Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it. * Apache Spark Basics & Architecture * Spark SQL * DataFrames * Brief Overview of Databricks Certified Developer for Apache Spark

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Databricks

Spark SQL Deep Dive @ Melbourne Spark Meetup

Databricks

Apache Spark Core – Practical Optimization

Databricks

Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...

DataWorks Summit/Hadoop Summit

Hyperspace is a recently open-sourced (https://github.com/microsoft/hyperspace) indexing sub-system from Microsoft. The key idea behind Hyperspace is simple: Users specify the indexes they want to build. Hyperspace builds these indexes using Apache Spark, and maintains metadata in its write-ahead log that is stored in the data lake. At runtime, Hyperspace automatically selects the best index to use for a given query without requiring users to rewrite their queries. Since Hyperspace was introduced, one of the most popular asks from the Spark community was indexing support for Delta Lake. In this talk, we present our experiences in designing and implementing Hyperspace support for Delta Lake and how it can be used for accelerating queries over Delta tables. We will cover the necessary foundations behind Delta Lake’s transaction log design and how Hyperspace enables indexing support that seamlessly works with the former’s time travel queries.

Hyperspace for Delta Lake

Databricks

** Hadoop Training: https://www.edureka.co/hadoop ** This Edureka PPT on Sqoop Tutorial will explain you the fundamentals of Apache Sqoop. It will also give you a brief idea on Sqoop Architecture. In the end, it will showcase a demo of data transfer between Mysql and Hadoop Below topics are covered in this video: 1. Problems with RDBMS 2. Need for Apache Sqoop 3. Introduction to Sqoop 4. Apache Sqoop Architecture 5. Sqoop Commands 6. Demo to transfer data between Mysql and Hadoop Check our complete Hadoop playlist here: https://goo.gl/hzUO0m Follow us to never miss an update in the future. Instagram: https://www.instagram.com/edureka_learning/ Facebook: https://www.facebook.com/edurekaIN/ Twitter: https://twitter.com/edurekain LinkedIn: https://www.linkedin.com/company/edureka

Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...

Edureka!

The columnar roadmap: Apache Parquet and Apache Arrow

Julien Le Dem

Video of the presentation can be seen here: https://www.youtube.com/watch?v=uxuLRiNoDio The Data Source API in Spark is a convenient feature that enables developers to write libraries to connect to data stored in various sources with Spark. Equipped with the Data Source API, users can load/save data from/to different data formats and systems with minimal setup and configuration. In this talk, we introduce the Data Source API and the unified load/save functions built on top of it. Then, we show examples to demonstrate how to build a data source library.

Data Source API in Spark

Databricks

There are a growing set of optimization mechanisms that allow you to achieve competitive SQL performance. Spark has extension points that help third parties to add customizations and optimizations without needing these optimizations to be merged into Apache Spark. This is very powerful and helps extensibility. We have added some enhancements to the existing extension points framework to enable some fine grained control. This talk will be a deep dive at the extension points that is available in Spark today. We will also talk about the enhancements to this API that we developed to help make this API more powerful. This talk will be of benefit to developers who are looking to customize Spark in their deployments.

How to Extend Apache Spark with Customized Optimizations

Databricks

Large Scale Graph Analytics with JanusGraph

P. Taylor Goetz

This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.

Programming in Spark using PySpark

Mostafa

Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on. You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal. By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs: Sizing the cluster based on your dataset (shuffle partitions) Ingestion challenges – well begun is half done (globbing S3, small files) Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you) Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win) Scheduling (FAIR vs FIFO, is there a difference for your pipeline?) Caching and persistence (it’s the cost of doing business, so what are your options?) Fault tolerance (blacklisting, speculation, task reaping) Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans) Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)

Apache Spark At Scale in the Cloud

Databricks

Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization. This talk will cover best practices I’ve applied over years in the field helping customers write Spark applications as well as identifying what patterns make sense for your use case.

Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...

Databricks

빅데이터 개념 부터 시작해서 빅데이터 분석 플랫폼의 출현(hadoop)과 스파크의 등장배경까지 풀어서 작성된 spark 소개 자료 입니다. 스파크는 RDD에 대한 개념과 spark SQL 라이브러리에 대한 자료가 조금 자세히 설명 되어있습니다. (텅스텐엔진, 카탈리스트 옵티마이져에 대한 간략한 설명이 있습니다.) 마지막에는 간단한 설치 및 interactive 분석 실습자료가 포함되어 있습니다. 원본 ppt 를 공개해 두었으니 언제 어디서든 필요에 따라 변형하여 사용하시되 출처만 잘 남겨주시면 감사드리겠습니다. 다른 슬라이드나, 블로그에서 사용된 그림과 참고한 자료들은 작게 출처를 표시해두었는데, 본 ppt의 초기버전을 작성하면서 찾았던 일부 자료들은 출처가 불분명한 상태입니다. 자료 출처를 알려주시면 반영하여 수정해 두도록하겠습니다. (제보 부탁드립니다!)

Apache spark 소개 및 실습

동현 강

What's hot (20)

Efficient Data Storage for Analytics with Apache Parquet 2.0

Apache Spark on K8S Best Practice and Performance in the Cloud

A Deep Dive into Query Execution Engine of Spark SQL

PySpark dataframe

Impala Architecture presentation

Growing the Delta Ecosystem to Rust and Python with Delta-RS

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Spark SQL Deep Dive @ Melbourne Spark Meetup

Apache Spark Core – Practical Optimization

Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...

Hyperspace for Delta Lake

Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...

The columnar roadmap: Apache Parquet and Apache Arrow

Data Source API in Spark

How to Extend Apache Spark with Customized Optimizations

Large Scale Graph Analytics with JanusGraph

Programming in Spark using PySpark

Apache Spark At Scale in the Cloud

Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...

Apache spark 소개 및 실습

Viewers also liked

Intro to Spark and Spark SQL

jeykottalam

Building a modern Application with DataFrames

Spark Summit

Spark, the ultra-fast, general purpose big data computing platform provides some very flexible options for processing and accessing data. In a previous meetup we covered PySpark and the Schema RDD. In this session we reviewed and expanded on this, with an in-depth exploration of Spark SQL. - Overview of Spark in the Hadoop ecosystem - Deep dive into Spark SQL with step by steps on how to implement and use it If you have questions about the presentation or want to learn more about our services, please visit our website: http://casertaconcepts.com/

Spark SQL

Caserta

Anatomy of Data Frame API : A deep dive into Spark Data Frame API

datamantra

Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast! This session will cover different ways of joining tables in Apache Spark. Speaker: Vida Ha This talk was originally presented at Spark Summit East 2017.

Optimizing Apache Spark SQL Joins

Databricks

Introducing DataFrames in Spark for Large Scale Data Science

Databricks

Spark what's new what's coming

Databricks

Spark + Scikit Learn- Performance Tuning

晨揚施

Building a modern Application with DataFrames

Databricks

spark 1.6을 기준으로 spark sql에 대해서 개략적으로 설명한 자료입니다. 발표 자료가 친절하지 않으나 한글로 된 자료가 없길래 혹시나 도움 되시는 분들이 있을까 하여 공유합니다. 발표자료 보다는 마지막 페이지의 참고자료들을 읽어보시기를 권장 드립니다. 출처만 남겨주시면 자유롭게 가져가셔서 사용하셔도 무방합니다.

Spark sql

동현 강

Predictive modeling healthcare

Taposh Roy

20140908 spark sql & catalyst

Takuya UESHIN

Building distributed processing system from scratch - Part 2

datamantra

Ranking the Web with Spark

Sylvain Zimmer

Introduction to Structured Streaming

datamantra

Keyboard covert channels

Freeman Zhang

AMP Camp 5 Intro

jeykottalam

Introduction to dataset

datamantra

Evolution of apache spark

datamantra

Spark sql meetup

Michael Zhang

Viewers also liked (20)

Intro to Spark and Spark SQL

Building a modern Application with DataFrames

Spark SQL

Anatomy of Data Frame API : A deep dive into Spark Data Frame API

Optimizing Apache Spark SQL Joins

Introducing DataFrames in Spark for Large Scale Data Science

Spark what's new what's coming

Spark + Scikit Learn- Performance Tuning

Building a modern Application with DataFrames

Spark sql

Predictive modeling healthcare

20140908 spark sql & catalyst

Building distributed processing system from scratch - Part 2

Ranking the Web with Spark

Introduction to Structured Streaming

Keyboard covert channels

AMP Camp 5 Intro

Introduction to dataset

Evolution of apache spark

Spark sql meetup

Similar to Spark sql

Introduction to Spark with Python

Gokhan Atil

Introduction to Structured Data Processing with Spark SQL

datamantra

Data processing with spark in r & python

Maloy Manna, PMP®

Real Time Analytics with Dse

DataStax Academy

Data Pipeline for The Big Data/Data Science OKC

Mark Smith

Machine Learning by Example - Apache Spark

Meeraj Kunnumpurath

700 Updatable Queries Per Second: Spark as a Real-Time Web Service

Evan Chan

700 Queries Per Second with Updates: Spark As A Real-Time Web Service

Spark Summit

Introduction to spark 2.0

datamantra

From Idea to Model: Productionizing Data Pipelines with Apache Airflow

Databricks

ACM TechTalks : Apache Arrow and the Future of Data Frames

Wes McKinney

Introduction to Spark 2.0 Dataset API

datamantra

When Spark applications operate on distributed data coming from disparate data sources, they often have to directly query data sources external to Spark such as backing relational databases, or data warehouses. For that, Spark provides Data Source APIs, which are a pluggable mechanism for accessing structured data through Spark SQL. Data Source APIs are tightly integrated with the Spark Optimizer. They provide optimizations such as filter push down to the external data source and column pruning. While these optimizations significantly speed up Spark query execution, depending on the data source, they only provide a subset of the functionality that can be pushed down and executed at the data source. As part of our ongoing project to provide a generic data source push down API, this presentation will show our work related to join push down. An example is star-schema join, which can be simply viewed as filters applied to the fact table. Today, Spark Optimizer recognizes star-schema joins based on heuristics and executes star-joins using efficient left-deep trees. An alternative execution proposed by this work is to push down the star-join to the external data source in order to take advantage of multi-column indexes defined on the fact tables, and other star-join optimization techniques implemented by the relational data source.

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...

Databricks

In this talk will show how Large Scale Data Analytics can be done with Spark and Cassandra on the DataStax Enterprise Platform. First we will give an overview of what is the Spark Cassandra Connector and how it enables working with large data sets. Then we will use the Spark Notebook to show live examples in the browser of interacting with the data. The example will load a large Movies Database from Cassandra into Spark and then show how that data can be transformed and analyzed using Spark.

Large Scale Data Analytics with Spark and Cassandra on the DSE Platform

DataStax Academy

Healthcare Claim Reimbursement using Apache Spark

Databricks

Presented at: Global Azure Bootcamp (Melbourne) Participants will get a deep dive into one of Azure’s newest offering: Azure Databricks, a fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure. In this session, we will go through Azure Databricks key collaboration features, cluster management, and tight data integration with Azure data sources. We’ll also walk through an end-to-end Recommendation System Data Pipeline built using Spark on Azure Databricks.

Spark as a Service with Azure Databricks

Lace Lofranco

Cowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов

Sigma Software

What comes to you if you want to have some fun with GraphQL? Join us for the meetup in Munich and enjoy the interesting talk about ‚GraphQL vs. (the) REST ‘. In this session our colleagues Tsvetan Nikolov (Senior Developer at coliquio) and Tom Sedlmeier (Senior Developer at coliquio) will discuss the basic concepts and how to use them. We will present you GraphQL on the backend and on the client side with Apollo.

GraphQL vs. (the) REST

coliquio GmbH

Introduction to Datasource V2 API

datamantra

Deploying Data Science Engines to Production

Mostafa Majidpour

Similar to Spark sql (20)

Introduction to Spark with Python

Introduction to Structured Data Processing with Spark SQL

Data processing with spark in r & python

Real Time Analytics with Dse

Data Pipeline for The Big Data/Data Science OKC

Machine Learning by Example - Apache Spark

700 Updatable Queries Per Second: Spark as a Real-Time Web Service

700 Queries Per Second with Updates: Spark As A Real-Time Web Service

Introduction to spark 2.0

From Idea to Model: Productionizing Data Pipelines with Apache Airflow

ACM TechTalks : Apache Arrow and the Future of Data Frames

Introduction to Spark 2.0 Dataset API

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...

Large Scale Data Analytics with Spark and Cassandra on the DSE Platform

Healthcare Claim Reimbursement using Apache Spark

Spark as a Service with Azure Databricks

Cowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов

GraphQL vs. (the) REST

Introduction to Datasource V2 API

Deploying Data Science Engines to Production

Recently uploaded

WSO2CON 2024 - How to Run a Security Program

WSO2

WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...

WSO2

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...

SelfMade bd

WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...

WSO2

WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...

WSO2

WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration

WSO2

WSO2CON 2024 Slides - Unlocking Value with AI

WSO2

WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation

WSO2

This presentation covers the following topics: What is logging? The purpose of logging: Debugging The purpose of logging: Security The purpose of logging: Stats & analytics Traditional logging Traditional logging: Advantages Traditional logging: Disadvantages The solution: Large-scale logging Large-scale logging: Core principles Large-scale logging: Solution types Large-scale logging: Cloud vs on-prem Large-scale logging: Operational complexity Large-scale logging: Security Large-scale logging: Costs Large-scale logging: On-prem comparison - Elasticsearch - Grafana Loki - VictoriaLogs On-prem comparison: Setup and operation On-prem comparison: Costs On-prem comparison: Full-text search support On-prem comparison: How to efficiently query 100TB of logs? On-prem comparison: Integration with CLI tools VictoriaLogs for large-scale logging VictoriaLogs demo instance - Ingestion rate: 3600 messages / minute - The number of log messages: 1.1 billion - Uncompressed log messages’ size: 1.5TB - Compressed log messages’ size: 23GB - Compression ratio: 47x - Memory usage: 150MB VictoriaLogs CLI integration demo - Which errors have occurred in all the apps during the last hour? - How many errors have occurred during the last hour? - Which apps generated the most of errors during the last hour? - The number of per-minute errors for the last 10 minutes - Status codes for the last hour - Non-200 status codes for the last week - Top client IPs for the last 4 weeks with 404 and 500 response status codes - Per-month stats for the given IP across all the logs Large-scale logging solution MUST provide excellent CLI integration VictoriaLogs: (temporary) drawbacks VictoriaLogs: Recap - Easy to setup and operate - The lowest RAM usage and disk space usage (up to 30x less than Elasticsearch and Grafana Loki) - Fast full-text search - Excellent integration with traditional command-line tools for log analysis - Accepts logs from popular log shippers (Filebeat, Fluentbit, Logstash, Vector, Promtail, Grafana Agent) - Open source and free to use!

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024

VictoriaMetrics

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

Philip Schwarz

WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...

WSO2

WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...

WSO2

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...

WSO2

WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...

WSO2

WSO2Con2024 - Software Delivery in Hybrid Environments

WSO2

WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity

WSO2

WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!

WSO2

WSO2Con2024 - Hello Choreo Presentation - Kanchana

WSO2

Announcing Codolex 2.0 from GDK Software

Jim McKeeth

WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...

WSO2

Recently uploaded (20)

WSO2CON 2024 - How to Run a Security Program

WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...

WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...

WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...

WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration

WSO2CON 2024 Slides - Unlocking Value with AI

WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...

WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...

WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...

WSO2Con2024 - Software Delivery in Hybrid Environments

WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity

WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!

WSO2Con2024 - Hello Choreo Presentation - Kanchana

Announcing Codolex 2.0 from GDK Software

WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...

Spark sql

1. SPARK SQL Shijie Zhang

2. OUTLINE ➤ Background ➤ Spark SQL overview ➤ Spark components ➤ Data Source API ➤ DataFrame API ➤ Catalyst optimizer ➤ SparkSQL future ➤ Conclusion ➤ References

3. WHY BIG DATA WITH SQL ➤ SQL is compatible with tooling ➤ e.g. Connect to existing BI tools via JDBC/ODBC ➤ Large pool of engineers proﬁcient in SQL ➤ Compared with MapReduce, SQL is more expressive/succinct

4. “Spark SQL is more than SQL -not so top secret

5. CHALLENGES ➤ Spark needs to perform ETL on various data source formats ➤ Solution: Data source APIs ➤ Implement SQL for Spark ➤ In an extensible way ➤ Solution: DataFrame API ➤ In an eﬃcient way ➤ Solution: Catalyst optimizer Spark SQL Major Components

6. SPARK SQL OVERALL ARCHITECTURE

7. SPARK SQL WORKFLOW

8. DATA SOURCE API ➤ Read and write with a variety of formats

9. DATA SOURCE API ➤ Uniﬁed interface to reading/writing data in a variety of formats

10. DATA SOURCE API ➤ Uniﬁed interface to reading/writing data in a variety of formats

11. DATA SOURCE API ➤ Uniﬁed interface to reading/writing data in a variety of formats

12. DATA SOURCE API ➤ Uniﬁed interface to reading/writing data in a variety of formats

13. SQL ON HADOOP ➤ Hive and Pig pipeline ➤ Problems ➤ Quite similar but implements its own optimizer/executor ➤ No common data structure to use both Hive/Pig/Other DSL

14. DATAFRAME API ➤ DataFrame as a front end ➤ Multiple DSL share same optimizer and executor ➤ Plug your own DSL too

15. MAJOR MILESTONE IN SPARK SQL SchemaRDD DataFrame API Datasets 1.0 1.2 1.3 1.5 1.6 Spark Version SparkSQL becomes Spark core distribution Rename due to OO structure change compile-time type safety further optimization

16. MAJOR MILESTONE IN SPARK SQL SchemaRDD DataFrame API Datasets 1.0 1.2 1.3 1.5 1.6 Spark Version SparkSQL becomes Spark core distribution Rename due to OO structure change compile-time type safety further optimization

17. DATAFRAME API ➤ Idea borrowed from Python Panda ➤ single node tabular data with API for (math, stats, algebra…) ➤ Def: ➤ RDD + Schema ➤ RDDs with additional relational operators such as ➤ selecting required columns ➤ joining diﬀerent data sources ➤ aggregation ➤ ﬁltering

18. DATAFRAME API ➤ Writing less code

19. DATAFRAME API ➤ Faster implementations

20. CATALYST OPTIMIZER ➤ Goal: convert logical plan to physical plans ➤ Process: ➤ Logical plan is a tree representing data and schema ➤ The tree is manipulated and optimized by catalyst rules

21. CATALYST OPTIMIZER ➤ An example query

22. CATALYST OPTIMIZER ➤ Native query planning

23. CATALYST OPTIMIZER ➤ Optimization Rules example

24. CATALYST OPTIMIZER ➤ Optimization Rules continued ➤ Allow deﬁning customized rules

25. CATALYST OPTIMIZER ➤ Optimizing rules ➤ Eliminate subqueries ➤ Constant folding ➤ Simplify ﬁlters ➤ PushPredicate through ﬁlter ➤ Project collapsing

26. SPARK SQL FUTURE ➤ Tungsten - Optimization for the next few years ➤ Begin with hardware trends

27. SPARK SQL FUTURE ➤ Tungsten - Optimization for the next few years ➤ Recall our SparkSQL workﬂow

28. SPARK SQL FUTURE ➤ Tungsten - Preparing Spark for next 5 years ➤ Begin with hardware trends

29. SPARK SQL FUTURE ➤ Tungsten - Preparing Spark for next 5 years ➤ Substantially speed up execution by optimizing CPU eﬃciency via ➤ Runtime code generation ➤ Exploiting cache locality ➤ Oﬀ-heap memory management

30. CONCLUSION ➤ SparkSQL intuition ➤ SparkSQL pipeline/ architecture

31. REFERENCES ➤ http://www.slideshare.net/datamantra/anatomy-of-data-frame-api ➤ http://www.slideshare.net/databricks/2015-0616-spark-summit ➤ http://www.slideshare.net/databricks/spark-sql-deep-dive- melbroune ➤ http://www.slideshare.net/databricks/spark-sqlsse2015public ➤ http://www.slideshare.net/datamantra/introduction-to-structured- data-in-spark ➤ Data bricks oﬃcial blog

32. MY LEARNING WORKFLOW ➤ Recently I gradually set up my workflow and reduced learning cycle ➤ Everyone has different learning workflows ➤ But let’s share and make progress together Knowledge collecting Summarization Sharing Evernote web clipper Organize notes by tags MindManager Blog:Evernote+postach.io Keynote/PowerPointEvernote markdown doc with Marxico

Spark sql

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark sql

Similar to Spark sql (20)

Recently uploaded

Recently uploaded (20)

Spark sql