Beyond SQL:
Speeding up Spark with DataFrames
Michael Armbrust - @michaelarmbrust
March 2015 – Spark Summit East
2
0	
  
50	
  
100	
  
150	
  
# of Unique Contributors
0	
  
50	
  
100	
  
150	
  
200	
  
# Of Commits Per Month
Graduated
from Alpha
in 1.3
About Me and SQL	
  
Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
3
SELECT	
  COUNT(*)	
  
FROM	
  hiveTable	
  
WHERE	
  hive_udf(data)	
  	
  
About Me and SQL	
  
Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
4
About Me and SQL	
  
Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
•  Bindings in Python, Scala, and Java
5
About Me and SQL	
  
Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
•  Bindings in Python, Scala, and Java
@michaelarmbrust
•  Lead developer of Spark SQL @databricks
About Me and
6
SQL	
  
The not-so-secret truth...
7
is not about SQL.
	
  
SQL	
  
Execution Engine Performance
8
0
50
100
150
200
250
300
350
400
450
3 7 19 27 34 42 43 46 52 53 55 59 63 68 73 79 89 98
TPC-DS Performance
Shark
Spark SQL
The not-so-secret truth...
9
is about more than SQL.
	
  
SQL	
  
Spark SQL: The whole story
Creating and Running Spark Programs Faster:
•  Write less code
•  Read less data
•  Let the optimizer do the hard work
10
DataFrame
noun – [dey-tuh-freym]
11
1.  A distributed collection of rows organized into
named columns.
2.  An abstraction for selecting, filtering, aggregating
and plotting structured data (cf. R, Pandas).
3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3).
	
  
Write Less Code: Input & Output
Spark SQL’s Data Source API can read and write DataFrames
using a variety of formats.
12
{ JSON }
Built-In External
JDBC
and more…
Write Less Code: High-Level Operations
Common operations can be expressed concisely as calls
to the DataFrame API:
•  Selecting required columns
•  Joining different data sources
•  Aggregation (count, sum, average, etc)
•  Filtering
13
Write Less Code: Compute an Average
private	
  IntWritable	
  one	
  =	
  	
  
	
  	
  new	
  IntWritable(1)	
  
private	
  IntWritable	
  output	
  =	
  
	
  	
  new	
  IntWritable()	
  
proctected	
  void	
  map(	
  
	
  	
  	
  	
  LongWritable	
  key,	
  
	
  	
  	
  	
  Text	
  value,	
  
	
  	
  	
  	
  Context	
  context)	
  {	
  
	
  	
  String[]	
  fields	
  =	
  value.split("t")	
  
	
  	
  output.set(Integer.parseInt(fields[1]))	
  
	
  	
  context.write(one,	
  output)	
  
}	
  
	
  
IntWritable	
  one	
  =	
  new	
  IntWritable(1)	
  
DoubleWritable	
  average	
  =	
  new	
  DoubleWritable()	
  
	
  
protected	
  void	
  reduce(	
  
	
  	
  	
  	
  IntWritable	
  key,	
  
	
  	
  	
  	
  Iterable<IntWritable>	
  values,	
  
	
  	
  	
  	
  Context	
  context)	
  {	
  
	
  	
  int	
  sum	
  =	
  0	
  
	
  	
  int	
  count	
  =	
  0	
  
	
  	
  for(IntWritable	
  value	
  :	
  values)	
  {	
  
	
  	
  	
  	
  	
  sum	
  +=	
  value.get()	
  
	
  	
  	
  	
  	
  count++	
  
	
  	
  	
  	
  }	
  
	
  	
  average.set(sum	
  /	
  (double)	
  count)	
  
	
  	
  context.Write(key,	
  average)	
  
}	
  
data	
  =	
  sc.textFile(...).split("t")	
  
data.map(lambda	
  x:	
  (x[0],	
  [x.[1],	
  1]))	
  	
  
	
  	
  	
  .reduceByKey(lambda	
  x,	
  y:	
  [x[0]	
  +	
  y[0],	
  x[1]	
  +	
  y[1]])	
  	
  
	
  	
  	
  .map(lambda	
  x:	
  [x[0],	
  x[1][0]	
  /	
  x[1][1]])	
  	
  
	
  	
  	
  .collect()	
  
Write Less Code: Compute an Average
15
Using RDDs
	
  
data	
  =	
  sc.textFile(...).split("t")	
  
data.map(lambda	
  x:	
  (x[0],	
  [int(x[1]),	
  1]))	
  	
  
	
  	
  	
  .reduceByKey(lambda	
  x,	
  y:	
  [x[0]	
  +	
  y[0],	
  x[1]	
  +	
  y[1]])	
  	
  
	
  	
  	
  .map(lambda	
  x:	
  [x[0],	
  x[1][0]	
  /	
  x[1][1]])	
  	
  
	
  	
  	
  .collect()	
  
	
  
	
  
	
  Using DataFrames
	
  
sqlCtx.table("people")	
  	
  
	
  	
  	
  .groupBy("name")	
  	
  
	
  	
  	
  .agg("name",	
  avg("age"))	
  	
  
	
  	
  	
  .collect()	
  	
  
	
  
Full API Docs
•  Python
•  Scala
•  Java
Not Just Less Code: Faster Implementations
16
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
17
Demo: Data Sources API
Using Spark SQL to read, write, and transform data in a variety of
formats.
http://people.apache.org/~marmbrus/talks/dataframe.demo.pdf
Read Less Data
The fastest way to process big data is to never read it.
Spark SQL can help you read less data automatically:
1Only supported for Parquet and Hive, more support coming in Spark 1.4 - 2Turned off by default in Spark 1.3 18
•  Converting to more efficient formats
•  Using columnar formats (i.e. parquet)
•  Using partitioning (i.e., /year=2014/month=02/…)1
•  Skipping data using statistics (i.e., min, max)2
•  Pushing predicates into storage systems (i.e., JDBC)
	
  
Plan Optimization & Execution
19
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Optimization happens as late as possible, therefore
Spark SQL can optimize even across functions.
20
21
def	
  add_demographics(events):	
  
	
  	
  	
  u	
  =	
  sqlCtx.table("users")	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  #	
  Load	
  Hive	
  table	
  
	
  	
  	
  events	
  	
  
	
  	
  	
  	
  	
  .join(u,	
  events.user_id	
  ==	
  u.user_id)	
  	
  	
  	
  	
  	
  #	
  Join	
  on	
  user_id	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  .withColumn("city",	
  zipToCity(df.zip))	
  	
  	
  	
  	
  	
  #	
  Run	
  udf	
  to	
  add	
  city	
  column	
  
	
  events	
  =	
  add_demographics(sqlCtx.load("/data/events",	
  "json"))	
  	
  
training_data	
  =	
  events.where(events.city	
  ==	
  "New	
  York").select(events.timestamp).collect()	
  	
  
Logical Plan
filter
join
events file users table
expensive
only join
relevent users
Physical Plan
join
scan
(events)
filter
scan
(users)
22
def	
  add_demographics(events):	
  
	
  	
  	
  u	
  =	
  sqlCtx.table("users")	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  #	
  Load	
  partitioned	
  Hive	
  table	
  
	
  	
  	
  events	
  	
  
	
  	
  	
  	
  	
  .join(u,	
  events.user_id	
  ==	
  u.user_id)	
  	
  	
  	
  	
  	
  #	
  Join	
  on	
  user_id	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  .withColumn("city",	
  zipToCity(df.zip))	
  	
  	
  	
  	
  	
  #	
  Run	
  udf	
  to	
  add	
  city	
  column	
  
	
  
Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events	
  =	
  add_demographics(sqlCtx.load("/data/events",	
  "parquet"))	
  	
  
training_data	
  =	
  events.where(events.city	
  ==	
  "New	
  York").select(events.timestamp).collect()	
  	
  
Logical Plan
filter
join
events file users table
Physical Plan
join
scan
(events)
filter
scan
(users)
Machine Learning Pipelines
23
tokenizer	
  =	
  Tokenizer(inputCol="text",	
  outputCol="words”)	
  
hashingTF	
  =	
  HashingTF(inputCol="words",	
  outputCol="features”)	
  
lr	
  =	
  LogisticRegression(maxIter=10,	
  regParam=0.01)	
  
pipeline	
  =	
  Pipeline(stages=[tokenizer,	
  hashingTF,	
  lr])	
  
	
  
df	
  =	
  sqlCtx.load("/path/to/data")	
  
model	
  =	
  pipeline.fit(df)	
  	
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
Pipeline Model	
  
Create and Run Spark Programs Faster:
•  Write less code
•  Read less data
•  Let the optimizer do the hard work
SQL	
  
Questions?

Beyond SQL: Speeding up Spark with DataFrames

  • 1.
    Beyond SQL: Speeding upSpark with DataFrames Michael Armbrust - @michaelarmbrust March 2015 – Spark Summit East
  • 2.
    2 0   50   100   150   # of Unique Contributors 0   50   100   150   200   # Of Commits Per Month Graduated from Alpha in 1.3 About Me and SQL   Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014)
  • 3.
    3 SELECT  COUNT(*)   FROM  hiveTable   WHERE  hive_udf(data)     About Me and SQL   Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments
  • 4.
    4 About Me andSQL   Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC
  • 5.
    Spark SQL •  Partof the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC •  Bindings in Python, Scala, and Java 5 About Me and SQL  
  • 6.
    Spark SQL •  Partof the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC •  Bindings in Python, Scala, and Java @michaelarmbrust •  Lead developer of Spark SQL @databricks About Me and 6 SQL  
  • 7.
    The not-so-secret truth... 7 isnot about SQL.   SQL  
  • 8.
    Execution Engine Performance 8 0 50 100 150 200 250 300 350 400 450 37 19 27 34 42 43 46 52 53 55 59 63 68 73 79 89 98 TPC-DS Performance Shark Spark SQL
  • 9.
    The not-so-secret truth... 9 isabout more than SQL.   SQL  
  • 10.
    Spark SQL: Thewhole story Creating and Running Spark Programs Faster: •  Write less code •  Read less data •  Let the optimizer do the hard work 10
  • 11.
    DataFrame noun – [dey-tuh-freym] 11 1. A distributed collection of rows organized into named columns. 2.  An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas). 3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3).  
  • 12.
    Write Less Code:Input & Output Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. 12 { JSON } Built-In External JDBC and more…
  • 13.
    Write Less Code:High-Level Operations Common operations can be expressed concisely as calls to the DataFrame API: •  Selecting required columns •  Joining different data sources •  Aggregation (count, sum, average, etc) •  Filtering 13
  • 14.
    Write Less Code:Compute an Average private  IntWritable  one  =        new  IntWritable(1)   private  IntWritable  output  =      new  IntWritable()   proctected  void  map(          LongWritable  key,          Text  value,          Context  context)  {      String[]  fields  =  value.split("t")      output.set(Integer.parseInt(fields[1]))      context.write(one,  output)   }     IntWritable  one  =  new  IntWritable(1)   DoubleWritable  average  =  new  DoubleWritable()     protected  void  reduce(          IntWritable  key,          Iterable<IntWritable>  values,          Context  context)  {      int  sum  =  0      int  count  =  0      for(IntWritable  value  :  values)  {            sum  +=  value.get()            count++          }      average.set(sum  /  (double)  count)      context.Write(key,  average)   }   data  =  sc.textFile(...).split("t")   data.map(lambda  x:  (x[0],  [x.[1],  1]))          .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])          .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])          .collect()  
  • 15.
    Write Less Code:Compute an Average 15 Using RDDs   data  =  sc.textFile(...).split("t")   data.map(lambda  x:  (x[0],  [int(x[1]),  1]))          .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])          .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])          .collect()        Using DataFrames   sqlCtx.table("people")          .groupBy("name")          .agg("name",  avg("age"))          .collect()       Full API Docs •  Python •  Scala •  Java
  • 16.
    Not Just LessCode: Faster Implementations 16 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame SQL Time to Aggregate 10 million int pairs (secs)
  • 17.
    17 Demo: Data SourcesAPI Using Spark SQL to read, write, and transform data in a variety of formats. http://people.apache.org/~marmbrus/talks/dataframe.demo.pdf
  • 18.
    Read Less Data Thefastest way to process big data is to never read it. Spark SQL can help you read less data automatically: 1Only supported for Parquet and Hive, more support coming in Spark 1.4 - 2Turned off by default in Spark 1.3 18 •  Converting to more efficient formats •  Using columnar formats (i.e. parquet) •  Using partitioning (i.e., /year=2014/month=02/…)1 •  Skipping data using statistics (i.e., min, max)2 •  Pushing predicates into storage systems (i.e., JDBC)  
  • 19.
    Plan Optimization &Execution 19 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 20.
    Optimization happens aslate as possible, therefore Spark SQL can optimize even across functions. 20
  • 21.
    21 def  add_demographics(events):        u  =  sqlCtx.table("users")                                          #  Load  Hive  table        events              .join(u,  events.user_id  ==  u.user_id)            #  Join  on  user_id                      .withColumn("city",  zipToCity(df.zip))            #  Run  udf  to  add  city  column    events  =  add_demographics(sqlCtx.load("/data/events",  "json"))     training_data  =  events.where(events.city  ==  "New  York").select(events.timestamp).collect()     Logical Plan filter join events file users table expensive only join relevent users Physical Plan join scan (events) filter scan (users)
  • 22.
    22 def  add_demographics(events):        u  =  sqlCtx.table("users")                                          #  Load  partitioned  Hive  table        events              .join(u,  events.user_id  ==  u.user_id)            #  Join  on  user_id                      .withColumn("city",  zipToCity(df.zip))            #  Run  udf  to  add  city  column     Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events  =  add_demographics(sqlCtx.load("/data/events",  "parquet"))     training_data  =  events.where(events.city  ==  "New  York").select(events.timestamp).collect()     Logical Plan filter join events file users table Physical Plan join scan (events) filter scan (users)
  • 23.
    Machine Learning Pipelines 23 tokenizer  =  Tokenizer(inputCol="text",  outputCol="words”)   hashingTF  =  HashingTF(inputCol="words",  outputCol="features”)   lr  =  LogisticRegression(maxIter=10,  regParam=0.01)   pipeline  =  Pipeline(stages=[tokenizer,  hashingTF,  lr])     df  =  sqlCtx.load("/path/to/data")   model  =  pipeline.fit(df)   ds0 ds1 ds2 ds3tokenizer hashingTF lr.model lr Pipeline Model  
  • 24.
    Create and RunSpark Programs Faster: •  Write less code •  Read less data •  Let the optimizer do the hard work SQL   Questions?