SlideShare a Scribd company logo
SPARK SQL
Xinh Huynh
Women in Big Data training workshop
August, 2016
Audience poll
https://commons.wikimedia.org/wiki/File:PEO-happy_person_raising_one_hand.svg
Outline
• Part 1: Spark SQL Overview, SQL Queries
• Part 2: DataFrame Queries
• Part 3: Additional DataFrame Functions
Outline
• Part 1: Spark SQL Overview, SQL Queries
• Part 2: DataFrame Queries
• Part 3: Additional DataFrame Functions
Why learn Spark SQL?
• Most popular component in Spark
• Spark Survey 2015
• Use cases
• ETL
• Analytics
• Feature Extraction for machine learning
% of users
0 18 35 53 70
Spark SQL
DataFrames
MLlib, GraphX
Streaming
Use case: ETL & analytics
• Example: restaurant finder app
• Log data: Timestamp, UserID, Location, RestaurantType
• [ 4/24/2014 6:22:51 PM, 1000618, -85.5750, 42.2959, Pizza ]
• Analytics
• What time of day do users use the app?
• What is the most popular restaurant type in San Jose, CA?
Logs ETL Analytics
Spark SQL Spark SQL
How Spark SQL fits into Spark (2.0)
Spark Core (RDD)
Catalyst
SQL DataFrame / Dataset
ML Pipelines
Structured
Streaming
GraphFrames
Spark SQL
http://www.slideshare.net/SparkSummit/deep-dive-into-catalyst-apache-spark-20s-optimizer-63071120
Spark SQL programming interfaces
Catalyst
SQL DataFrame / Dataset
Spark SQL
SQL Scala, Java, R, Python Scala, Java
SQL or DataFrame?
• Use SQL if you are already familiar with SQL
• Use DataFrame
• To write queries in a general-purpose programming language
(Scala, Python, …).
• Use DataFrame to catch syntax errors earlier:
SQL DataFrame
Syntax Error
Example
“SELEECT id FROM table” df.seleect(“id”)
Caught at Runtime Compile Time
Loading and examining a table, Query with SQL
• See Notebook: http://tinyurl.com/spark-nb1
Setup for Hands-on Training
1. Sign on to WiFi with your assigned access code
1. See slip of paper in front of your seat
2. Sign in to https://community.cloud.databricks.com/
3. Go to "Clusters" and create a Spark 2.0 cluster
1. This may take a minute.
4. Go to “Workspace” -> Users -> Home -> Create ->
Notebook
1. Select Language = Scala
2. Create
Outline
• Part 1: Spark SQL Overview, SQL Queries
• Part 2: DataFrame Queries
• Part 3: Additional DataFrame Functions
DataFrame API
• See notebook: http://tinyurl.com/spark-nb2
Lazy Execution
• DataFrame operations are lazy
• Work is delayed until the last possible moment
• Transformations: DF -> DF
• select, groupBy; no computation done
• Actions: DF -> console or disk output
• show, collect, count, write; computation is done
https://www.flickr.com/photos/mtch3l/24491625352
Lazy Execution Example
1. val df1 = df.select(…)
2. val df2 = df1.groupBy (…)
3. .sum()
4. if (cond)
5. df2.show()
• Benefits of laziness
• Query optimization across lines 1-3
• If step 5 is not executed, then no unnecessary work was done
Transformation: no
computation done
Transformation: no
computation done
Action: performs the
select, groupBy at this
time, then shows the
results
Caching
• When querying the same data set over and over, caching it
in memory may speed up queries.
• Back to notebook …
Disk Memory Results
Memory Results
Without
caching:
With
caching:
Outline
• Part 1: Spark SQL Overview, SQL Queries
• Part 2: DataFrame Queries
• Part 3: Additional DataFrame Functions
Use case: Feature Extraction for ML
• Example: restaurant finder app
• Log data: Timestamp, UserID, Location, RestaurantType
• [ 4/24/2014 6:22:51 PM, 1000618, -85.5750, 42.2959, Pizza ]
• Machine Learning to train a model of user preferences
• Use Spark SQL to extract features for the model
• Example features: hour of day, distance to a restaurant, restaurant
type
Logs ETL Features ML Training
Spark SQL Spark SQL
See Notebook …
Functions for DataFrames
• See notebook: http://tinyurl.com/spark-nb3
Dataset (new in 2.0)
• DataFrames are untyped
• df.select($”col1” + 3)
• Useful when exploring new data
• Datasets are typed
• Dataset[T]
• Associates an object of type T with each row
• Catches type mismatches at compile time
• DataFrame = Dataset[Row]
• A DataFrame is one specific type of Dataset[T]
case class FarmersMarket(FMID: Int, MarketName: String)
val ds : Dataset[FarmersMarket] …
Numerical type assumed, but
not checked at compile time
Review
• Part 1: Spark SQL Overview, SQL Queries √
• Part 2: DataFrame Queries √
• Part 3: Additional DataFrame Functions √
References
• Spark SQL: http://spark.apache.org/docs/latest/sql-
programming-guide.html
• Spark Scala API docs: http://spark.apache.org/docs/latest/
api/scala/index.html#org.apache.spark.package
• Overview of DataFrames: http://
xinhstechblog.blogspot.com/2016/05/overview-of-spark-
dataframe-api.html
• Questions, comments:
• Spark user list: user@spark.apache.org
• Xinh’s contact: https://www.linkedin.com/in/xinh-huynh-317608
• Women in Big Data: https://www.womeninbigdata.org/

More Related Content

What's hot

New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema OrhianJaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Spark Summit
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Spark etl
Spark etlSpark etl
Spark etl
Imran Rashid
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
Spark sql
Spark sqlSpark sql
Spark sql
Zahra Eskandari
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
Yan Zhou
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
Provectus
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
 Lessons from the Field, Episode II: Applying Best Practices to Your Apache S... Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Databricks
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 

What's hot (20)

New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema OrhianJaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema Orhian
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Spark etl
Spark etlSpark etl
Spark etl
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
 Lessons from the Field, Episode II: Applying Best Practices to Your Apache S... Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 

Similar to Introduction to Spark SQL training workshop

Visibility-from web application interface to the database
Visibility-from web application interface to the databaseVisibility-from web application interface to the database
Visibility-from web application interface to the database
ManageEngine, Zoho Corporation
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1
sqlserver.co.il
 
Key to optimal end user experience
Key to optimal end user experienceKey to optimal end user experience
Key to optimal end user experience
ManageEngine, Zoho Corporation
 
SharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi Vončina
SPC Adriatics
 
CCI2019 - Monitorare SQL Server Senza Andare in Bancarotta
CCI2019 - Monitorare SQL Server Senza Andare in BancarottaCCI2019 - Monitorare SQL Server Senza Andare in Bancarotta
CCI2019 - Monitorare SQL Server Senza Andare in Bancarotta
walk2talk srl
 
Observability with Spring-based distributed systems
Observability with Spring-based distributed systemsObservability with Spring-based distributed systems
Observability with Spring-based distributed systems
Rakuten Group, Inc.
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
Treasure Data, Inc.
 
REST Api Tips and Tricks
REST Api Tips and TricksREST Api Tips and Tricks
REST Api Tips and Tricks
Maksym Bruner
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced Session
Splunk
 
Benchmarking at Parse
Benchmarking at ParseBenchmarking at Parse
Benchmarking at Parse
Travis Redman
 
Advanced Benchmarking at Parse
Advanced Benchmarking at ParseAdvanced Benchmarking at Parse
Advanced Benchmarking at Parse
MongoDB
 
Server and application monitoring webinars [Applications Manager] - Part 2
Server and application monitoring webinars [Applications Manager] - Part 2Server and application monitoring webinars [Applications Manager] - Part 2
Server and application monitoring webinars [Applications Manager] - Part 2
ManageEngine, Zoho Corporation
 
Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
Jungsu Heo
 
Server and application monitoring webinars [Applications Manager]: Part 1
Server and application monitoring webinars [Applications Manager]: Part 1Server and application monitoring webinars [Applications Manager]: Part 1
Server and application monitoring webinars [Applications Manager]: Part 1
ManageEngine, Zoho Corporation
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)
Petter Skodvin-Hvammen
 
Building high performance and scalable share point applications
Building high performance and scalable share point applicationsBuilding high performance and scalable share point applications
Building high performance and scalable share point applications
Talbott Crowell
 
CCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysisCCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysis
walk2talk srl
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs
 

Similar to Introduction to Spark SQL training workshop (20)

Visibility-from web application interface to the database
Visibility-from web application interface to the databaseVisibility-from web application interface to the database
Visibility-from web application interface to the database
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1
 
Key to optimal end user experience
Key to optimal end user experienceKey to optimal end user experience
Key to optimal end user experience
 
SharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi Vončina
 
CCI2019 - Monitorare SQL Server Senza Andare in Bancarotta
CCI2019 - Monitorare SQL Server Senza Andare in BancarottaCCI2019 - Monitorare SQL Server Senza Andare in Bancarotta
CCI2019 - Monitorare SQL Server Senza Andare in Bancarotta
 
Observability with Spring-based distributed systems
Observability with Spring-based distributed systemsObservability with Spring-based distributed systems
Observability with Spring-based distributed systems
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
REST Api Tips and Tricks
REST Api Tips and TricksREST Api Tips and Tricks
REST Api Tips and Tricks
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced Session
 
Benchmarking at Parse
Benchmarking at ParseBenchmarking at Parse
Benchmarking at Parse
 
Advanced Benchmarking at Parse
Advanced Benchmarking at ParseAdvanced Benchmarking at Parse
Advanced Benchmarking at Parse
 
Server and application monitoring webinars [Applications Manager] - Part 2
Server and application monitoring webinars [Applications Manager] - Part 2Server and application monitoring webinars [Applications Manager] - Part 2
Server and application monitoring webinars [Applications Manager] - Part 2
 
Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
 
Server and application monitoring webinars [Applications Manager]: Part 1
Server and application monitoring webinars [Applications Manager]: Part 1Server and application monitoring webinars [Applications Manager]: Part 1
Server and application monitoring webinars [Applications Manager]: Part 1
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)
 
Building high performance and scalable share point applications
Building high performance and scalable share point applicationsBuilding high performance and scalable share point applications
Building high performance and scalable share point applications
 
CCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysisCCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysis
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 

Recently uploaded

2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Envertis Software Solutions
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
kalichargn70th171
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 

Recently uploaded (20)

2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 

Introduction to Spark SQL training workshop

  • 1. SPARK SQL Xinh Huynh Women in Big Data training workshop August, 2016
  • 3. Outline • Part 1: Spark SQL Overview, SQL Queries • Part 2: DataFrame Queries • Part 3: Additional DataFrame Functions
  • 4. Outline • Part 1: Spark SQL Overview, SQL Queries • Part 2: DataFrame Queries • Part 3: Additional DataFrame Functions
  • 5. Why learn Spark SQL? • Most popular component in Spark • Spark Survey 2015 • Use cases • ETL • Analytics • Feature Extraction for machine learning % of users 0 18 35 53 70 Spark SQL DataFrames MLlib, GraphX Streaming
  • 6. Use case: ETL & analytics • Example: restaurant finder app • Log data: Timestamp, UserID, Location, RestaurantType • [ 4/24/2014 6:22:51 PM, 1000618, -85.5750, 42.2959, Pizza ] • Analytics • What time of day do users use the app? • What is the most popular restaurant type in San Jose, CA? Logs ETL Analytics Spark SQL Spark SQL
  • 7. How Spark SQL fits into Spark (2.0) Spark Core (RDD) Catalyst SQL DataFrame / Dataset ML Pipelines Structured Streaming GraphFrames Spark SQL http://www.slideshare.net/SparkSummit/deep-dive-into-catalyst-apache-spark-20s-optimizer-63071120
  • 8. Spark SQL programming interfaces Catalyst SQL DataFrame / Dataset Spark SQL SQL Scala, Java, R, Python Scala, Java
  • 9. SQL or DataFrame? • Use SQL if you are already familiar with SQL • Use DataFrame • To write queries in a general-purpose programming language (Scala, Python, …). • Use DataFrame to catch syntax errors earlier: SQL DataFrame Syntax Error Example “SELEECT id FROM table” df.seleect(“id”) Caught at Runtime Compile Time
  • 10. Loading and examining a table, Query with SQL • See Notebook: http://tinyurl.com/spark-nb1
  • 11. Setup for Hands-on Training 1. Sign on to WiFi with your assigned access code 1. See slip of paper in front of your seat 2. Sign in to https://community.cloud.databricks.com/ 3. Go to "Clusters" and create a Spark 2.0 cluster 1. This may take a minute. 4. Go to “Workspace” -> Users -> Home -> Create -> Notebook 1. Select Language = Scala 2. Create
  • 12. Outline • Part 1: Spark SQL Overview, SQL Queries • Part 2: DataFrame Queries • Part 3: Additional DataFrame Functions
  • 13. DataFrame API • See notebook: http://tinyurl.com/spark-nb2
  • 14. Lazy Execution • DataFrame operations are lazy • Work is delayed until the last possible moment • Transformations: DF -> DF • select, groupBy; no computation done • Actions: DF -> console or disk output • show, collect, count, write; computation is done https://www.flickr.com/photos/mtch3l/24491625352
  • 15. Lazy Execution Example 1. val df1 = df.select(…) 2. val df2 = df1.groupBy (…) 3. .sum() 4. if (cond) 5. df2.show() • Benefits of laziness • Query optimization across lines 1-3 • If step 5 is not executed, then no unnecessary work was done Transformation: no computation done Transformation: no computation done Action: performs the select, groupBy at this time, then shows the results
  • 16. Caching • When querying the same data set over and over, caching it in memory may speed up queries. • Back to notebook … Disk Memory Results Memory Results Without caching: With caching:
  • 17. Outline • Part 1: Spark SQL Overview, SQL Queries • Part 2: DataFrame Queries • Part 3: Additional DataFrame Functions
  • 18. Use case: Feature Extraction for ML • Example: restaurant finder app • Log data: Timestamp, UserID, Location, RestaurantType • [ 4/24/2014 6:22:51 PM, 1000618, -85.5750, 42.2959, Pizza ] • Machine Learning to train a model of user preferences • Use Spark SQL to extract features for the model • Example features: hour of day, distance to a restaurant, restaurant type Logs ETL Features ML Training Spark SQL Spark SQL See Notebook …
  • 19. Functions for DataFrames • See notebook: http://tinyurl.com/spark-nb3
  • 20. Dataset (new in 2.0) • DataFrames are untyped • df.select($”col1” + 3) • Useful when exploring new data • Datasets are typed • Dataset[T] • Associates an object of type T with each row • Catches type mismatches at compile time • DataFrame = Dataset[Row] • A DataFrame is one specific type of Dataset[T] case class FarmersMarket(FMID: Int, MarketName: String) val ds : Dataset[FarmersMarket] … Numerical type assumed, but not checked at compile time
  • 21. Review • Part 1: Spark SQL Overview, SQL Queries √ • Part 2: DataFrame Queries √ • Part 3: Additional DataFrame Functions √
  • 22. References • Spark SQL: http://spark.apache.org/docs/latest/sql- programming-guide.html • Spark Scala API docs: http://spark.apache.org/docs/latest/ api/scala/index.html#org.apache.spark.package • Overview of DataFrames: http:// xinhstechblog.blogspot.com/2016/05/overview-of-spark- dataframe-api.html • Questions, comments: • Spark user list: user@spark.apache.org • Xinh’s contact: https://www.linkedin.com/in/xinh-huynh-317608 • Women in Big Data: https://www.womeninbigdata.org/