SlideShare a Scribd company logo
Agenda
● Brief Review of Spark (15 min)
● Intro to Spark SQL (30 min)
● Code session 1: Lab (45 min)
● Break (15 min)
● Intermediate Topics in Spark SQL (30 min)
● Code session 2: Quiz (30 min)
● Wrap up (15 min)
Spark Review
By Aaron Merlob
Apache Spark
● Open-source cluster computing framework
● “Successor” to Hadoop MapReduce
● Supports Scala, Java, and Python!
https://en.wikipedia.org/wiki/Apache_Spark
Spark Core + Libraries
https://spark.apache.org
Resilient Distributed Dataset
● Distributed Collection
● Fault-tolerant
● Parallel operation - Partitioned
● Many data sources
Implementation...
RDD - Main Abstraction
Immutable
Mute
Immutable
Lazily Evaluated
Cachable
Type Inferred
Lazily Evaluated
How Good Is Aaron’s Presentation? Immutable
Lazily Evaluated
Cachable
Type Inferred
Cachable
Immutable
Lazily Evaluated
Cachable
Type Inferred
Type Inferred (Scala)
Immutable
Lazily Evaluated
Cachable
Type Inferred
RDD Operations
Actions
Transformations
Cache & Persist
Transformed RDDs recomputed each action
Store RDDs in memory using cache (or persist)
SparkContext.
● Your way to get data into/out of RDDs
● Given as ‘sc’ when you launch Spark shell.
For example: sc.parallelize()
SparkContext
Transformation vs. Action?
val data = sc.parallelize(Seq(
“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))
val words = data.flatMap(d => d.split(" "))
val result = words.map(word => (word, 1)).
reduceByKey((v1, v2) => v1 + v2)
result.filter( kv => kv._1.contains(“a”) ).count()
result.filter{ case (k, v) => v > 2 }.count()
Transformation vs. Action?
val data = sc.parallelize(Seq(
“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))
val words = data.flatMap(d => d.split(" "))
val result = words.map(word => (word, 1)).
reduceByKey((v1, v2) => v1 + v2)
result.filter( kv => kv._1.contains(“a”) ).count()
result.filter{ case (k, v) => v > 2 }.count()
Transformation vs. Action?
val data = sc.parallelize(Seq(
“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))
val words = data.flatMap(d => d.split(" "))
val result = words.map(word => (word, 1)).
reduceByKey((v1, v2) => v1 + v2)
result.filter( kv => kv._1.contains(“a”) ).count()
result.filter{ case (k, v) => v > 2 }.count()
Transformation vs. Action?
val data = sc.parallelize(Seq(
“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))
val words = data.flatMap(d => d.split(" "))
val result = words.map(word => (word, 1)).
reduceByKey((v1, v2) => v1 + v2)
result.filter( kv => kv._1.contains(“a”) ).count()
result.filter{ case (k, v) => v > 2 }.count()
Transformation vs. Action?
val data = sc.parallelize(Seq(
“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))
val words = data.flatMap(d => d.split(" "))
val result = words.map(word => (word, 1)).
reduceByKey((v1, v2) => v1 + v2).cache()
result.filter( kv => kv._1.contains(“a”) ).count()
result.filter{ case (k, v) => v > 2 }.count()
Spark SQL
By Aaron Merlob
Spark SQL
RDDs with Schemas!
Spark SQL
RDDs with Schemas!
Schemas = Table Names +
Column Names +
Column Types = Metadata
Schemas
● Schema Pros
○ Enable column names instead of column positions
○ Queries using SQL (or DataFrame) syntax
○ Make your data more structured
● Schema Cons
○ ??
○ ??
○ ??
Schemas
● Schema Pros
○ Enable column names instead of column positions
○ Queries using SQL (or DataFrame) syntax
○ Make your data more structured
● Schema Cons
○ Make your data more structured
○ Reduce future flexibility (app is more fragile)
○ Y2K
HiveContext
val sqlContext = new org.apache.spark.sql.
hive.HiveContext(sc)
HiveContext
val sqlContext = new org.apache.spark.sql.
hive.HiveContext(sc)
FYI - a less preferred alternative:
org.apache.spark.sql.SQLContext
DataFrames
Primary abstraction in Spark SQL
Evolved from SchemaRDD
Exposes functionality via SQL or DF API
SQL for developer productivity (ETL, BI, etc)
DF for data scientist productivity (R / Pandas)
Live Coding - Spark-Shell
Maven Packages for CSV and Avro
org.apache.hadoop:hadoop-aws:2.7.1
com.amazonaws:aws-java-sdk-s3:1.10.30
com.databricks:spark-csv_2.10:1.3.0
com.databricks:spark-avro_2.10:2.0.1
spark-shell --packages $SPARK_PKGS
Live Coding - Loading CSV
val path = "AAPL.csv"
val df = sqlContext.read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "true").
load(path)
df.registerTempTable("stocks")
Caching
If I run a query twice, how many times will the
data be read from disk?
Caching
If I run a query twice, how many times will the
data be read from disk?
1. RDDs are lazy.
2. Therefore the data will be read twice.
3. Unless you cache the RDD, All transformations
in the RDD will execute on each action.
Caching Tables
sqlContext.cacheTable("stocks")
Particularly useful when using Spark SQL to
explore data, and if your data is on S3.
sqlContext.uncacheTable("stocks")
Caching in SQL
SQL Command Speed
`CACHE TABLE sales;` Eagerly
`CACHE LAZY TABLE sales;` Lazily
`UNCACHE TABLE sales;` Eagerly
Caching Comparison
Caching Spark SQL DataFrames vs
caching plain non-DataFrame RDDs
● RDDs cached at level of individual records
● DataFrames know more about the data.
● DataFrames are cached using an in-memory
columnar format.
Caching Comparison
What is the difference between these:
(a) sqlContext.cacheTable("df_table")
(b) df.cache
(c) sqlContext.sql("CACHE TABLE df_table")
Lab 1
Spark SQL Workshop
Spark SQL,
the Sequel
By Aaron Merlob
Live Coding - Filetype ETL
● Read in a CSV
● Export as JSON or Parquet
● Read JSON
Live Coding - Common
● Show
● Sample
● Take
● First
Read Formats
Format Read
Parquet sqlContext.read.parquet(path)
ORC sqlContext.read.orc(path)
JSON sqlContext.read.json(path)
CSV sqlContext.read.format(“com.databricks.spark.csv”).load(path)
Write Formats
Format Write
Parquet sqlContext.write.parquet(path)
ORC sqlContext.write.orc(path)
JSON sqlContext.write.json(path)
CSV sqlContext.write.format(“com.databricks.spark.csv”).save(path)
Schema Inference
Infer schema of JSON files:
● By default it scans the entire file.
● It finds the broadest type that will fit a field.
● This is an RDD operation so it happens fast.
Infer schema of CSV files:
● CSV parser uses same logic as JSON
parser.
User Defined Functions
How do you apply a “UDF”?
● Import types (StringType, IntegerType, etc)
● Create UDF (in Scala)
● Apply the function (in SQL)
Notes:
● UDFs can take single or multiple arguments
● Optional registerFunction arg2: ‘return type’
Live Coding - UDF
● Import types (StringType, IntegerType, etc)
● Create UDF (in Scala)
● Apply the function (in SQL)
Live Coding - Autocomplete
Find all types available for SQL schemas +UDF
Types and their meanings:
StringType = String
IntegerType = Int
DoubleType = Double
Spark UI on port 4040

More Related Content

What's hot

Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
Russell Spitzer
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New World
Databricks
 
Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_spark
Yiguang Hu
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
Fernando Rodriguez
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
Russell Spitzer
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
Josef Adersberger
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
Patrick McFadin
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
Russell Spitzer
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
Updates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI IndexesUpdates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI Indexes
Jim Hatcher
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016
Duyhai Doan
 

What's hot (20)

Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New World
 
Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_spark
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Updates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI IndexesUpdates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI Indexes
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016
 

Similar to Spark4

Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
Taewook Eom
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
台灣資料科學年會
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
Kyle Burke
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
Giivee The
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
Andy Petrella
 
Apache Spark
Apache SparkApache Spark
Apache Spark
Uwe Printz
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
Gal Marder
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache Spark
Yasoda Jayaweera
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Databricks
 

Similar to Spark4 (20)

Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Spark core
Spark coreSpark core
Spark core
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache Spark
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 

More from poovarasu maniandan

Spark7
Spark7Spark7
Spark3
Spark3Spark3
Spark2
Spark2Spark2
Ml3
Ml3Ml3
Ml8
Ml8Ml8
Ml2
Ml2Ml2
Ml7
Ml7Ml7
Ml5
Ml5Ml5
Blue arm
Blue armBlue arm
Literature survey
Literature surveyLiterature survey
Literature survey
poovarasu maniandan
 
Home security system using internet of things
Home security system using internet of thingsHome security system using internet of things
Home security system using internet of things
poovarasu maniandan
 
rescue robot
rescue robotrescue robot
rescue robot
poovarasu maniandan
 

More from poovarasu maniandan (12)

Spark7
Spark7Spark7
Spark7
 
Spark3
Spark3Spark3
Spark3
 
Spark2
Spark2Spark2
Spark2
 
Ml3
Ml3Ml3
Ml3
 
Ml8
Ml8Ml8
Ml8
 
Ml2
Ml2Ml2
Ml2
 
Ml7
Ml7Ml7
Ml7
 
Ml5
Ml5Ml5
Ml5
 
Blue arm
Blue armBlue arm
Blue arm
 
Literature survey
Literature surveyLiterature survey
Literature survey
 
Home security system using internet of things
Home security system using internet of thingsHome security system using internet of things
Home security system using internet of things
 
rescue robot
rescue robotrescue robot
rescue robot
 

Recently uploaded

一比一原版(UCSB毕业证)圣塔芭芭拉社区大学毕业证如何办理
一比一原版(UCSB毕业证)圣塔芭芭拉社区大学毕业证如何办理一比一原版(UCSB毕业证)圣塔芭芭拉社区大学毕业证如何办理
一比一原版(UCSB毕业证)圣塔芭芭拉社区大学毕业证如何办理
aozcue
 
按照学校原版(Greenwich文凭证书)格林威治大学毕业证快速办理
按照学校原版(Greenwich文凭证书)格林威治大学毕业证快速办理按照学校原版(Greenwich文凭证书)格林威治大学毕业证快速办理
按照学校原版(Greenwich文凭证书)格林威治大学毕业证快速办理
yizxn4sx
 
按照学校原版(KCL文凭证书)伦敦国王学院毕业证快速办理
按照学校原版(KCL文凭证书)伦敦国王学院毕业证快速办理按照学校原版(KCL文凭证书)伦敦国王学院毕业证快速办理
按照学校原版(KCL文凭证书)伦敦国王学院毕业证快速办理
terpt4iu
 
按照学校原版(Birmingham文凭证书)伯明翰大学|学院毕业证快速办理
按照学校原版(Birmingham文凭证书)伯明翰大学|学院毕业证快速办理按照学校原版(Birmingham文凭证书)伯明翰大学|学院毕业证快速办理
按照学校原版(Birmingham文凭证书)伯明翰大学|学院毕业证快速办理
6oo02s6l
 
按照学校原版(Adelaide文凭证书)阿德莱德大学毕业证快速办理
按照学校原版(Adelaide文凭证书)阿德莱德大学毕业证快速办理按照学校原版(Adelaide文凭证书)阿德莱德大学毕业证快速办理
按照学校原版(Adelaide文凭证书)阿德莱德大学毕业证快速办理
terpt4iu
 
一比一原版(Greenwich文凭证书)格林威治大学毕业证如何办理
一比一原版(Greenwich文凭证书)格林威治大学毕业证如何办理一比一原版(Greenwich文凭证书)格林威治大学毕业证如何办理
一比一原版(Greenwich文凭证书)格林威治大学毕业证如何办理
byfazef
 
按照学校原版(UST文凭证书)圣托马斯大学毕业证快速办理
按照学校原版(UST文凭证书)圣托马斯大学毕业证快速办理按照学校原版(UST文凭证书)圣托马斯大学毕业证快速办理
按照学校原版(UST文凭证书)圣托马斯大学毕业证快速办理
zpc0z12
 
LORRAINE ANDREI_LEQUIGAN_GOOGLE CALENDAR
LORRAINE ANDREI_LEQUIGAN_GOOGLE CALENDARLORRAINE ANDREI_LEQUIGAN_GOOGLE CALENDAR
LORRAINE ANDREI_LEQUIGAN_GOOGLE CALENDAR
lorraineandreiamcidl
 
加急办理美国南加州大学毕业证文凭毕业证原版一模一样
加急办理美国南加州大学毕业证文凭毕业证原版一模一样加急办理美国南加州大学毕业证文凭毕业证原版一模一样
加急办理美国南加州大学毕业证文凭毕业证原版一模一样
u0g33km
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证如何办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证如何办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证如何办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证如何办理
aozcue
 
按照学校原版(Westminster文凭证书)威斯敏斯特大学毕业证快速办理
按照学校原版(Westminster文凭证书)威斯敏斯特大学毕业证快速办理按照学校原版(Westminster文凭证书)威斯敏斯特大学毕业证快速办理
按照学校原版(Westminster文凭证书)威斯敏斯特大学毕业证快速办理
yizxn4sx
 
一比一原版(Adelaide文凭证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide文凭证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide文凭证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide文凭证书)阿德莱德大学毕业证如何办理
xuqdabu
 
一比一原版(Monash文凭证书)莫纳什大学毕业证如何办理
一比一原版(Monash文凭证书)莫纳什大学毕业证如何办理一比一原版(Monash文凭证书)莫纳什大学毕业证如何办理
一比一原版(Monash文凭证书)莫纳什大学毕业证如何办理
xuqdabu
 
按照学校原版(UAL文凭证书)伦敦艺术大学毕业证快速办理
按照学校原版(UAL文凭证书)伦敦艺术大学毕业证快速办理按照学校原版(UAL文凭证书)伦敦艺术大学毕业证快速办理
按照学校原版(UAL文凭证书)伦敦艺术大学毕业证快速办理
yizxn4sx
 
Production.pptxd dddddddddddddddddddddddddddddddddd
Production.pptxd ddddddddddddddddddddddddddddddddddProduction.pptxd dddddddddddddddddddddddddddddddddd
Production.pptxd dddddddddddddddddddddddddddddddddd
DanielOliver74
 
一比一原版(TheAuckland毕业证书)新西兰奥克兰大学毕业证如何办理
一比一原版(TheAuckland毕业证书)新西兰奥克兰大学毕业证如何办理一比一原版(TheAuckland毕业证书)新西兰奥克兰大学毕业证如何办理
一比一原版(TheAuckland毕业证书)新西兰奥克兰大学毕业证如何办理
xuqdabu
 
按照学校原版(UVic文凭证书)维多利亚大学毕业证快速办理
按照学校原版(UVic文凭证书)维多利亚大学毕业证快速办理按照学校原版(UVic文凭证书)维多利亚大学毕业证快速办理
按照学校原版(UVic文凭证书)维多利亚大学毕业证快速办理
1jtj7yul
 
Why is the AIS 140 standard Mandatory in India?
Why is the AIS 140 standard Mandatory in India?Why is the AIS 140 standard Mandatory in India?
Why is the AIS 140 standard Mandatory in India?
Watsoo Telematics
 
按照学校原版(SUT文凭证书)斯威本科技大学毕业证快速办理
按照学校原版(SUT文凭证书)斯威本科技大学毕业证快速办理按照学校原版(SUT文凭证书)斯威本科技大学毕业证快速办理
按照学校原版(SUT文凭证书)斯威本科技大学毕业证快速办理
1jtj7yul
 
按照学校原版(AU文凭证书)英国阿伯丁大学毕业证快速办理
按照学校原版(AU文凭证书)英国阿伯丁大学毕业证快速办理按照学校原版(AU文凭证书)英国阿伯丁大学毕业证快速办理
按照学校原版(AU文凭证书)英国阿伯丁大学毕业证快速办理
ei8c4cba
 

Recently uploaded (20)

一比一原版(UCSB毕业证)圣塔芭芭拉社区大学毕业证如何办理
一比一原版(UCSB毕业证)圣塔芭芭拉社区大学毕业证如何办理一比一原版(UCSB毕业证)圣塔芭芭拉社区大学毕业证如何办理
一比一原版(UCSB毕业证)圣塔芭芭拉社区大学毕业证如何办理
 
按照学校原版(Greenwich文凭证书)格林威治大学毕业证快速办理
按照学校原版(Greenwich文凭证书)格林威治大学毕业证快速办理按照学校原版(Greenwich文凭证书)格林威治大学毕业证快速办理
按照学校原版(Greenwich文凭证书)格林威治大学毕业证快速办理
 
按照学校原版(KCL文凭证书)伦敦国王学院毕业证快速办理
按照学校原版(KCL文凭证书)伦敦国王学院毕业证快速办理按照学校原版(KCL文凭证书)伦敦国王学院毕业证快速办理
按照学校原版(KCL文凭证书)伦敦国王学院毕业证快速办理
 
按照学校原版(Birmingham文凭证书)伯明翰大学|学院毕业证快速办理
按照学校原版(Birmingham文凭证书)伯明翰大学|学院毕业证快速办理按照学校原版(Birmingham文凭证书)伯明翰大学|学院毕业证快速办理
按照学校原版(Birmingham文凭证书)伯明翰大学|学院毕业证快速办理
 
按照学校原版(Adelaide文凭证书)阿德莱德大学毕业证快速办理
按照学校原版(Adelaide文凭证书)阿德莱德大学毕业证快速办理按照学校原版(Adelaide文凭证书)阿德莱德大学毕业证快速办理
按照学校原版(Adelaide文凭证书)阿德莱德大学毕业证快速办理
 
一比一原版(Greenwich文凭证书)格林威治大学毕业证如何办理
一比一原版(Greenwich文凭证书)格林威治大学毕业证如何办理一比一原版(Greenwich文凭证书)格林威治大学毕业证如何办理
一比一原版(Greenwich文凭证书)格林威治大学毕业证如何办理
 
按照学校原版(UST文凭证书)圣托马斯大学毕业证快速办理
按照学校原版(UST文凭证书)圣托马斯大学毕业证快速办理按照学校原版(UST文凭证书)圣托马斯大学毕业证快速办理
按照学校原版(UST文凭证书)圣托马斯大学毕业证快速办理
 
LORRAINE ANDREI_LEQUIGAN_GOOGLE CALENDAR
LORRAINE ANDREI_LEQUIGAN_GOOGLE CALENDARLORRAINE ANDREI_LEQUIGAN_GOOGLE CALENDAR
LORRAINE ANDREI_LEQUIGAN_GOOGLE CALENDAR
 
加急办理美国南加州大学毕业证文凭毕业证原版一模一样
加急办理美国南加州大学毕业证文凭毕业证原版一模一样加急办理美国南加州大学毕业证文凭毕业证原版一模一样
加急办理美国南加州大学毕业证文凭毕业证原版一模一样
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证如何办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证如何办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证如何办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证如何办理
 
按照学校原版(Westminster文凭证书)威斯敏斯特大学毕业证快速办理
按照学校原版(Westminster文凭证书)威斯敏斯特大学毕业证快速办理按照学校原版(Westminster文凭证书)威斯敏斯特大学毕业证快速办理
按照学校原版(Westminster文凭证书)威斯敏斯特大学毕业证快速办理
 
一比一原版(Adelaide文凭证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide文凭证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide文凭证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide文凭证书)阿德莱德大学毕业证如何办理
 
一比一原版(Monash文凭证书)莫纳什大学毕业证如何办理
一比一原版(Monash文凭证书)莫纳什大学毕业证如何办理一比一原版(Monash文凭证书)莫纳什大学毕业证如何办理
一比一原版(Monash文凭证书)莫纳什大学毕业证如何办理
 
按照学校原版(UAL文凭证书)伦敦艺术大学毕业证快速办理
按照学校原版(UAL文凭证书)伦敦艺术大学毕业证快速办理按照学校原版(UAL文凭证书)伦敦艺术大学毕业证快速办理
按照学校原版(UAL文凭证书)伦敦艺术大学毕业证快速办理
 
Production.pptxd dddddddddddddddddddddddddddddddddd
Production.pptxd ddddddddddddddddddddddddddddddddddProduction.pptxd dddddddddddddddddddddddddddddddddd
Production.pptxd dddddddddddddddddddddddddddddddddd
 
一比一原版(TheAuckland毕业证书)新西兰奥克兰大学毕业证如何办理
一比一原版(TheAuckland毕业证书)新西兰奥克兰大学毕业证如何办理一比一原版(TheAuckland毕业证书)新西兰奥克兰大学毕业证如何办理
一比一原版(TheAuckland毕业证书)新西兰奥克兰大学毕业证如何办理
 
按照学校原版(UVic文凭证书)维多利亚大学毕业证快速办理
按照学校原版(UVic文凭证书)维多利亚大学毕业证快速办理按照学校原版(UVic文凭证书)维多利亚大学毕业证快速办理
按照学校原版(UVic文凭证书)维多利亚大学毕业证快速办理
 
Why is the AIS 140 standard Mandatory in India?
Why is the AIS 140 standard Mandatory in India?Why is the AIS 140 standard Mandatory in India?
Why is the AIS 140 standard Mandatory in India?
 
按照学校原版(SUT文凭证书)斯威本科技大学毕业证快速办理
按照学校原版(SUT文凭证书)斯威本科技大学毕业证快速办理按照学校原版(SUT文凭证书)斯威本科技大学毕业证快速办理
按照学校原版(SUT文凭证书)斯威本科技大学毕业证快速办理
 
按照学校原版(AU文凭证书)英国阿伯丁大学毕业证快速办理
按照学校原版(AU文凭证书)英国阿伯丁大学毕业证快速办理按照学校原版(AU文凭证书)英国阿伯丁大学毕业证快速办理
按照学校原版(AU文凭证书)英国阿伯丁大学毕业证快速办理
 

Spark4

  • 1. Agenda ● Brief Review of Spark (15 min) ● Intro to Spark SQL (30 min) ● Code session 1: Lab (45 min) ● Break (15 min) ● Intermediate Topics in Spark SQL (30 min) ● Code session 2: Quiz (30 min) ● Wrap up (15 min)
  • 3. Apache Spark ● Open-source cluster computing framework ● “Successor” to Hadoop MapReduce ● Supports Scala, Java, and Python! https://en.wikipedia.org/wiki/Apache_Spark
  • 4. Spark Core + Libraries https://spark.apache.org
  • 5. Resilient Distributed Dataset ● Distributed Collection ● Fault-tolerant ● Parallel operation - Partitioned ● Many data sources Implementation... RDD - Main Abstraction
  • 7. Lazily Evaluated How Good Is Aaron’s Presentation? Immutable Lazily Evaluated Cachable Type Inferred
  • 9. Type Inferred (Scala) Immutable Lazily Evaluated Cachable Type Inferred
  • 11. Cache & Persist Transformed RDDs recomputed each action Store RDDs in memory using cache (or persist)
  • 12. SparkContext. ● Your way to get data into/out of RDDs ● Given as ‘sc’ when you launch Spark shell. For example: sc.parallelize() SparkContext
  • 13. Transformation vs. Action? val data = sc.parallelize(Seq( “Aaron Aaron”, “Aaron Brian”, “Charlie”, “” )) val words = data.flatMap(d => d.split(" ")) val result = words.map(word => (word, 1)). reduceByKey((v1, v2) => v1 + v2) result.filter( kv => kv._1.contains(“a”) ).count() result.filter{ case (k, v) => v > 2 }.count()
  • 14. Transformation vs. Action? val data = sc.parallelize(Seq( “Aaron Aaron”, “Aaron Brian”, “Charlie”, “” )) val words = data.flatMap(d => d.split(" ")) val result = words.map(word => (word, 1)). reduceByKey((v1, v2) => v1 + v2) result.filter( kv => kv._1.contains(“a”) ).count() result.filter{ case (k, v) => v > 2 }.count()
  • 15. Transformation vs. Action? val data = sc.parallelize(Seq( “Aaron Aaron”, “Aaron Brian”, “Charlie”, “” )) val words = data.flatMap(d => d.split(" ")) val result = words.map(word => (word, 1)). reduceByKey((v1, v2) => v1 + v2) result.filter( kv => kv._1.contains(“a”) ).count() result.filter{ case (k, v) => v > 2 }.count()
  • 16. Transformation vs. Action? val data = sc.parallelize(Seq( “Aaron Aaron”, “Aaron Brian”, “Charlie”, “” )) val words = data.flatMap(d => d.split(" ")) val result = words.map(word => (word, 1)). reduceByKey((v1, v2) => v1 + v2) result.filter( kv => kv._1.contains(“a”) ).count() result.filter{ case (k, v) => v > 2 }.count()
  • 17. Transformation vs. Action? val data = sc.parallelize(Seq( “Aaron Aaron”, “Aaron Brian”, “Charlie”, “” )) val words = data.flatMap(d => d.split(" ")) val result = words.map(word => (word, 1)). reduceByKey((v1, v2) => v1 + v2).cache() result.filter( kv => kv._1.contains(“a”) ).count() result.filter{ case (k, v) => v > 2 }.count()
  • 20. Spark SQL RDDs with Schemas! Schemas = Table Names + Column Names + Column Types = Metadata
  • 21. Schemas ● Schema Pros ○ Enable column names instead of column positions ○ Queries using SQL (or DataFrame) syntax ○ Make your data more structured ● Schema Cons ○ ?? ○ ?? ○ ??
  • 22. Schemas ● Schema Pros ○ Enable column names instead of column positions ○ Queries using SQL (or DataFrame) syntax ○ Make your data more structured ● Schema Cons ○ Make your data more structured ○ Reduce future flexibility (app is more fragile) ○ Y2K
  • 23. HiveContext val sqlContext = new org.apache.spark.sql. hive.HiveContext(sc)
  • 24. HiveContext val sqlContext = new org.apache.spark.sql. hive.HiveContext(sc) FYI - a less preferred alternative: org.apache.spark.sql.SQLContext
  • 25. DataFrames Primary abstraction in Spark SQL Evolved from SchemaRDD Exposes functionality via SQL or DF API SQL for developer productivity (ETL, BI, etc) DF for data scientist productivity (R / Pandas)
  • 26. Live Coding - Spark-Shell Maven Packages for CSV and Avro org.apache.hadoop:hadoop-aws:2.7.1 com.amazonaws:aws-java-sdk-s3:1.10.30 com.databricks:spark-csv_2.10:1.3.0 com.databricks:spark-avro_2.10:2.0.1 spark-shell --packages $SPARK_PKGS
  • 27. Live Coding - Loading CSV val path = "AAPL.csv" val df = sqlContext.read. format("com.databricks.spark.csv"). option("header", "true"). option("inferSchema", "true"). load(path) df.registerTempTable("stocks")
  • 28. Caching If I run a query twice, how many times will the data be read from disk?
  • 29. Caching If I run a query twice, how many times will the data be read from disk? 1. RDDs are lazy. 2. Therefore the data will be read twice. 3. Unless you cache the RDD, All transformations in the RDD will execute on each action.
  • 30. Caching Tables sqlContext.cacheTable("stocks") Particularly useful when using Spark SQL to explore data, and if your data is on S3. sqlContext.uncacheTable("stocks")
  • 31. Caching in SQL SQL Command Speed `CACHE TABLE sales;` Eagerly `CACHE LAZY TABLE sales;` Lazily `UNCACHE TABLE sales;` Eagerly
  • 32. Caching Comparison Caching Spark SQL DataFrames vs caching plain non-DataFrame RDDs ● RDDs cached at level of individual records ● DataFrames know more about the data. ● DataFrames are cached using an in-memory columnar format.
  • 33. Caching Comparison What is the difference between these: (a) sqlContext.cacheTable("df_table") (b) df.cache (c) sqlContext.sql("CACHE TABLE df_table")
  • 34. Lab 1 Spark SQL Workshop
  • 35. Spark SQL, the Sequel By Aaron Merlob
  • 36. Live Coding - Filetype ETL ● Read in a CSV ● Export as JSON or Parquet ● Read JSON
  • 37. Live Coding - Common ● Show ● Sample ● Take ● First
  • 38. Read Formats Format Read Parquet sqlContext.read.parquet(path) ORC sqlContext.read.orc(path) JSON sqlContext.read.json(path) CSV sqlContext.read.format(“com.databricks.spark.csv”).load(path)
  • 39. Write Formats Format Write Parquet sqlContext.write.parquet(path) ORC sqlContext.write.orc(path) JSON sqlContext.write.json(path) CSV sqlContext.write.format(“com.databricks.spark.csv”).save(path)
  • 40. Schema Inference Infer schema of JSON files: ● By default it scans the entire file. ● It finds the broadest type that will fit a field. ● This is an RDD operation so it happens fast. Infer schema of CSV files: ● CSV parser uses same logic as JSON parser.
  • 41. User Defined Functions How do you apply a “UDF”? ● Import types (StringType, IntegerType, etc) ● Create UDF (in Scala) ● Apply the function (in SQL) Notes: ● UDFs can take single or multiple arguments ● Optional registerFunction arg2: ‘return type’
  • 42. Live Coding - UDF ● Import types (StringType, IntegerType, etc) ● Create UDF (in Scala) ● Apply the function (in SQL)
  • 43. Live Coding - Autocomplete Find all types available for SQL schemas +UDF Types and their meanings: StringType = String IntegerType = Int DoubleType = Double
  • 44. Spark UI on port 4040